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SUMMUCT 


The present report, number I3R-3, contains a record of the 
continuing investigation of various techniques for automatic content 
analysis, and for the storage and search of structured information. 

In particular, seme storage allocation techniques are examined which 
are useful la the aiaipulation of natural language data] a variety 
of machine programs and experiments are then described for the 
processing of bibliographic citations; and two models for a completely 
automatic document retrieval ay stem are outlined. 

Section I by 0. Salton outlines two automatic document retrieval 
systems based on the usual word frequency counting procedures, supple¬ 
mented by a number of auxiliary aids which replace a complete semantic 
analysis of the language. The principal aids consist of a hierarchical 
storage arrangement between certain subject categories, a set of cross 
references between related subjects, and sets of related or synonymous 
words attached to certain subject categories. Some basic retrieval 
procedures which make use of these structures are outlined. 

Section H by M. Thompson includes a detailed description of 
a program designed to match bibliographic citations. The program is 
divided into two parts: the first part normalises the format of each 
citation, and the second part performs the actual comparison between 
citations. This program, when completed, should be useful for the 
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automatic construction of citation indexes, and for the manipulation 
of bibliographic data in general. 

Section III by* E. H. Sussenguth, Jr., contains a description 
and evaluation of a chained tree structure useful for the storage of 
natural language data. The proposed storage organization permits a 
search procedure which is almost as efficient as that provided by a 
binary search, and yet is easily adapted to file changes, such as 
additions and deletions. 

Sections IV, V, and VI by M. Leak cover experiments performed 
with bibliographic citations for purposes of content analysis, as 
originally described in Section III of report ISR-2. Section IV is 
a description of the basic interpretive program used to perform the 
required matrix manipulations; the program accommodates matrices of 
variable size up to the capacity of the available core storage, and 
the symbolic operation codes reduce the program specification to an 
almost trivial operation. 

Section V contains an analysis of citations and index term 
similarities obtained for a closed document collection for which 
citations from documents outside the collection are eliminated; the 
data are then compared with the citation similarities obtained for the 
corresponding open document collection. The similarity coefficients 
are found to be smaller for the open document collection, but the basic 
results are in general comparable for both collections. 


xiv 



Section VI describee a clustering experiment in which documents 
were grouped in accordance with similarities in the index terns 
attached to the documents, and in the corresponding citation sets. 

The two types of groupings are cceqsared and are found to give dissimilar 
clusters for the document collection under investigation. 
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I. SOME HIERARCHICAL MODELS FOR AUTOMATIC DOCUMENT RETRIEVAL 

Gerard Salton 


1. Introduction 

Automatic systems designed to furnish references or specific data 
in answer to search requests are known as information retrieval systems. 
If such systems are to perform effectively, provision must be made for 
the execution of the following types of operation: 

(a) the analysis of information items and of search 
requests; 

(b) the generation of identifiers used to represent 
information content; 

(c) the normalization of information identifications to 
conform with some classification system; 

(d) the storage of information items and identifications 
so as to simplify access to related items; 

(e) the matching of information requests with information 
identifications, and the search for relevant items. 

In order to perform a satisfactory analysis of the information, 
it is generally necessary to identify the basic elements which are used 
to represent the information and to recognize the rules by which the 
basic elements can be combined into larger units. For example, if the 
items of information to be stored consist of chemical compounds, the 
basic elements are atoms and bonds, and these can in turn be combined 
into larger molecules in accordance with certain structural rules. 
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If the items of information to be dealt with are documents or 
books, the basic elements are words in the natural language, and the 
larger units are sentences, paragraphs or chapters. The analysis of in¬ 
formation consists in this case of the identification of document content. 
Unhappily, though it is somewhat easy to isolate the individual words in 
a text, the interpretation of the meaning of the words is much more diffi¬ 
cult. Furthermore, no well-defined set of rules is known by which the 
individual words in the language are combined into meaningful word groups 
or sentences. Specifically, the correct identification of the meaning of 
word groups depends at least in part on the proper recognition of syntactic 
and semantic ambiguities, on the correct interpretation of homographs, on 
the recognition of semantic equivalences, on the detection of word rela¬ 
tions, and on a general awareness of the background and environment of a 
given utterance. 

Because of the many difficulties which arise in the semantic 
analysis of the natural language, certain simplifications are normally 
introduced before automatic analyses are attempted. These simplifications 
take the form of restricting the permitted area of discourse, or, alter¬ 
natively, of limiting the types of linguistic structures permitted in the 
texts to be analyzed. Occasionally it is suggested that a special unam¬ 
biguous language be used in preference to the natural language. 

In many cases these restrictions may not be realistic, since it 
may be difficult effectively to control the area of discourse, or the 
types of structures being used. Moreover, no simplified or artificial 
language is likely to prove generally acceptable. The present report is 
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therefore principally concerned with document retrieval methods using the 
unrestricted natural language for the representation of information. All 
processes are based on methods which can be carried out automatically 
including, in particular, those dealing with the identification of docu¬ 
ment content. A complete semantic dictionary giving all contexts for 
each word in the language is not constructed in view of the many diffi¬ 
culties inherent in the required linguistic analysis. For similar reasons 
most of the relations between words and sentences in a text are not 
explicitly identified. Instead, standard word counting and syntactic 
analysis techniques are used in conjunction with a small number of table 
look-up operations. The construction of the required tables is further 
described in the following sections. 

2. Retrieval Structures 

Since the principal input to the retrieval system consists of 
texts in the natural language, the words used in a given document must 
of necessity constitute the basic units to be dealt with. It is also 
useful to identify the principal classes of relations between words; 
specifically two main relational classes are normally distinguished: the 
generic or inclusion relations, sometimes known as analytic relations, 
and the nongeneric or synthetic relations. The first class of relations 
is exemplified by the hierarchical subject arrangements provided in most 
library classification schedules. For example, the term "artery* 1 may be 
included under ff heart system," which may be included under "organs of the 
body," which may in turn be included under "physiologyand so on. The 
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second class of relations may consist of real dependence relation between 
two terms, such as the cause-effect relation between "poison" and "death," 
or it may be a formal relation, such as when two normally unrelated words 
are identified as equivalent in a given context, or are joined by a coor¬ 
dinating conjunction. 

In order to perform a reasonably effective content analysis, it is 
thus useful to take into account not only the presence of certain words in 
a text, but also the principal analytic and synthetic word relations. If 
this is to be accomplished without exhaustive semantic analysis of each 
word, it is necessary to avail oneself of a number of auxiliary aids. The 
following structures are useful: 

(a) a hierarchical arrangement between certain subject 
categories as provided by many library classification 
systems; 

(b) a set of cross references between related subjects such 
as those specified by many classified or alphabetized 
word indexes; 

(c) a set of related or synonymous words attached to each 
subject category to identify terms which may be used 
in similar contexts. 

The hierarchical subject arrangement identifies some of the 
generic relations, and the cross references and lists of related words 
specify the dependence and equivalence relations between terms. The cross 
references are analogous to the "see also" references available in many 
library classification systems, and the related word lists are similar to 
the "see" references. 



(a) Abstract Hierarchical structure with 
Cross References and Synonym Lists 



(b) Abstract Hierarchical Structure with 
Cross References and Criterion Trees 


Key 


—> cross reference 

o keyword or phrase with 


category indicator 

-© generic (inclusion) 


relation 

= synonym list 


<^/\,criterion trees 


Hierarchical Structures 


Figure 1 


The complete arrangement is represented abstractly by the struc¬ 
ture of Fig. 1(a). Each subject category is denoted by one or more nodes 
in the structure. Inclusion relations defined for certain subjects are 
represented by nonhorizontal branches between the corresponding nodes, the 
"included" subject appearing on a physically lower level on the page than 
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the "including' 1 subject. Cross references are represented by directed lines 
between specified nodes» and the related word lists by sets of short paral¬ 
lel line segments. 

Consider as an example the hierarchical structure shown in Fig. 2 
which includes certain terms from the field of physiology. One of the 
cross references connects the term "heart" considered as an organ to the 
term "cardiac illness" classified under pathology. It is clear from Fig. 2 
that a given term may appear in various places within the hierarchical 
arrangement. This is true in particular when a subject may be considered 
from a variety of viewpoints, in which case different generic relations 
and different related word lists normally apply. The various subject 
classification provided by the Library of Congress (LC) classification 
system for the term "sexuality" are shown as an example in Fig. 3. Sample 
related word lists are included in Fig. 3 as are the LC subject indicators. 

A storage arrangement of the type shown in Fig. 1(a) lends itself 
to reasonably effective retrieval procedures. Indeed, document identifi¬ 
cations and request identifications may easily be normalized before being 
used by substituting for each original identifier the subject indicator 
corresponding to the next higher node in the hierarchical arrangement. 
Documents can then be classified into subject categories, and document 
clusters of similar documents can be generated, based not on the words 
originally extracted from the documents or on the terms originally chosen 
by a variety of more or less reliable methods, but based on the subject 
categories corresponding to the nodes within the hierarchies. The normal¬ 
ization procedure will thus make it easier to match documents and requests 
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Science 

Physiology 
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I Organs 

(A) Circulation 

i. General 


II Biology 

(A) General 


ii. Heart System 

heart - 

artery 

vein 


(B) Phenomena 

i. Exchanges 

-► ii. Reproduction 

cellular 

asexual 


(B) Reprodu ction <— 
i. Structure 
male 
female 

► See also Embryology 
Biology 
Medicine 


'III Systems 

(A) Muscular 

(B) Nervous 


(C) Blood *- 

i. Constituents 
ii. Blood Group 


V Chemistry 

(A) Physical Measures 


(B) Organic 

i. Alcohols 
^-^ii. Hormones - 


- iii. Sexuality 

—** See also Sociology 
Psychology 
Medicine 
Anthropology 
Literature 

IV Pathology 

(A) General 

(B) Symptoms 


(C) Illnesses 

-i. Blood 

->> ii. Cardiac 


See also Pharmacology 
Chemistry 
Toxicology 


VI Endocrinology 
(A) Glands 


(B) Secretions 
-► i. Hormones 
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Sexuality 


1. As a social science (H), 

u. 

Medical and health aspects (R) 

including statistics (HB), 


including gynecology (RG), 

family problems (HQ), 


medical practice (RC), 

social pathology (HV), etc; 


public health (RA), etc; 

marriage, love, abortion, 


instinct, operation 

ethics, men, women, 


abortion, hygiene, 

bed, birth, birth control, 

• • • 

contraception,... 

2. From the physiological view- 


From the point of view of 

point (Q), including sexual 


cultural differences among 

organs (QL,QM), reproduction 


races (G), including ethnol- 

(QP,QH), etc; 


ogy (GN) and folklore (GR); 

sex organs, reproduction, 


marriage, folklore, 

instinct, puberty, anatomy 


incest,... 

heredity,... 



3. Psychological and religious 

6. 

From the point of view of 

aspects (B), including 


development in the 

reproduction (BF), moral 


literature (P), including 

theology (BX), sex worship (BL); 

literary history (PN), 

reproduction, Oedipus complex, 

and English literature (PR); 

sex worship, emotion, 


magazine, Freud, 

passion, excitement, 


sex crimes,... 





7. 

As an art (N), including 



erotica, etc. 


Sample Related Word Lists with 
Library of Congress Subject Indicators 

Figure 3 


for documents since the use of related terms will always refer back to the 
same subject identified either through the related word lists or by the 
cross-referencing process. 

The retrieval process will also benefit from various other possible 
transformations. Retrieval requests may, for example, be broadened by 
replacing the original terms by new terms which appear on a physically 
higher level in the hierarchy, and by following up the cross references. 








Contrariwise, requests may be narrowed or refined by including terms which 
appear on a physically lower level. The matching procedure may also be 
extended by considering as equivalent any term within the same list of 
related words, or any term within a given "distance” from some other 
term in the hierarchy. 

It is often useful to include as part of the normal retrieval 
structures not only individual disconnected words or terms, but also 
word pairs or triples, or indeed complete phrases and sentences. A docu¬ 
ment might, for example, be identified more closely by a set of phrases 
than by a set of individual words. Each list of related terms included 
in the hierarchy of Fig. 1(a) might therefore be supplemented or re¬ 
placed by a "list of related phrases" which identifies the subject indi¬ 
cator of the corresponding node. It is convenient to represent the 
syntactic structure of each phrase in tree form, in such a way that each 

word is represented by a node of the tree and the syntactic dependencies 

1 2 

by the branches of the tree. ’ The trees corresponding to the various 
phrases attached to a given node in the hierarchy are called criterion 
trees, and their addition to the system gives rise to the structure of 
Fig. 1(b). 

A sample set of related phrases is shown in Fig. 1;. The examples 
of Fig. 1; refer to the subject "Philosophy of Education" and include 
phrases such as 

liberal [education] , 

[education] for *, 

[controversy] in [education], 
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-PR : 

-P0 : 

[goal] (n.) 
of 

[educational] (n.) 

-S : 

-0 s 

[education] (n.) 

[world] 

-V$ j 
-V : 

-0 : 

[desirability] 

[assist] (v.) 

[public] 

1.. .A: 

1.. . ! 

(adj.) 

[education], means 

-A : 

(adj. ) 

: 

[controversy] 

: 

approach (n.) 

-PR : 

in, about, ever 

-PR : 

to 

-P0 : 

[education] (n.) 

-P0 « 

[education] fn.) 


[development] 

. 

[education] (n.) 

-PR : 

of 

-PR : 

for 

-P0 : 

[ability] , [knowledge] (n .) 

-P0PM: 

- 

-Q : 

[education] 

-A : 

liberal (adj.) 

-Q : 

(n.) 

: 

[attempt] , [sugge stion] (n . ), 

: 

[education] (n.) 


[education] (n.) 

-PR : 

for, in, of 

-S : 

[education] (n«) 

-P0 : 

[ability] 

-V$ : 

L J 

[desirability] 

-A : 

vocational 

-VDVR: 

to 

: 

guidance , [education] ( n .) 

-VDV : 

-(▼.) 

-V : 

[education] 

-S : 

[education] (n.) 

-VPR : 

w J 

for 

-V* : 

[desirability] 

-V 

[education] 

-0VR : 

to 

-VPVR: 

to 

-0V : 

- (v.) 

-VP : 



Sample Criterion Trees for 
’’Philosophy and Aims of Education” 

Figure i* 
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and bo on. The bracketed items represent special' synorym classes or sets 
of related terms, and the asterisk stands for ary arbitrary word. For the 
class [education] it is thus possible to substitute ary one of a number of 
terms such as "education”, "school", "university", "college", "academy", 
"lecturing", "explaining", and so on. 

As a result, each one of the criterion phrases of Fig.lt represents 
a number of actual word strings related to one of the subject indicators 
in the hierarchy. The individual words of each phrase are listed in Fig.U 
with a special syntactic code, and the set of codes corresponding to ary 
given phrase can be used to represent the syntactic tree structure of the 
phrase. Furthermore the set of syntactic codes can be produced automat¬ 
ically as output of a syntactic analysis routine, and it is possible auto¬ 
matically to transform the tree structure into the code structure and 
vice-versa.^ The terms "criterion phrases" and "criterion trees" can 
therefore be used interchangeably. 

The criterion trees can be used in conjunction with syntactic 
analysis techniques for the classification of documents. It is necessary 
to obtain only the syntactic tree form of the sentences in a document, 
and to match these analyzed document excerpts with the criterion trees. 
Wherever a match occurs, the subject indicator attached to the correspond¬ 
ing node of the hierarchical structures is assigned to the given document. 
This will be further detailed in the next section. 
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The techniques of language normalization and of request alterations 
which were previously mentioned in connection with the model of Fig. 1(a) 
can be used unchanged with the tree structures of Fig. 1(b), In addition 
to the generic and cross-reference substitutions previously described, the 
model of Fig. 1(b) also makes possible syntactic substitutions of various 
types before search requests are matched with document identifications. 
Various kinds of syntactic equivalences may, for example, be defined; 
phrases or words exhibiting specific syntactic indicators may be ignored 
in the matching process; and completely unspecified nodes may be admitted 
as part of the tree structures. 

In the next section, machine programs are outlined which make use 
of the retrieval structures of Fig. 1 for the classification of information, 
for the generation of document clusters, and for the matching of document 
identifications with search requests. 

3. Retrieval Procedures 

A. The Quantitative Model 

It is well known that word frequency counting procedures are being 
used extensively for the identification of document content and for a vari¬ 
ety of other documentation tasks, including in particular, automatic index¬ 
ing, automatic abstracting and the generation of word and document asso¬ 
ciations.^ The basic procedure consists in performing a frequency count of 
all the words in a document, rejecting certain high-frequency function words, 
such as prepositions and conjunctions, combining varying forms of words 




with similar stems, and using the remaining high-frequency words to 
represent the document content. Each document is thus identified by 
a specific set of high-frequency words . •. ,W n * 

It is often desirable to use word groups instead of individual words 
for the identification of document content. Such word groups are 
generated by identifying those sets of words which tend to occur jointly 
in similar contexts. More simply, the assumption is made that if two 
high-frequency content-words co-occur in several sentences of a text, 
they are related in some sense, and may therefore be grouped. 

A typical method for the generation of word groups is: 

(a) construct a word-sentence incidence matrix C which 
lists content words against sentences; matrix element 

is defined to be equal to n if and only if sentence 
j contains word i exactly n times; 

(b) define a coefficient of similarity between words based 
on frequency of co-occurrence between pairs of words 
in the sentences; 

(c) generate a word-word similarity matrix R which exhibits 
all similarity coefficients between pairs of content 
words; 

(d) define word groups corresponding to those word pairs 
whose associated similarity coefficient is greater 
than some stated tnreshold value. 

A typical word-sentence incidence matrix C is shown in Fig. 5(a). 

To obtain a coefficient of similarity between two words based on the 





frequency of co-occurrence in the various sentences of a document, it is 
only necessary to perform a pairwise comparison of the corresponding rows 
of (3. A useful coefficient of similarity between rows of a numeric matrix 
is the cosine of the angle between the corresponding m-dimensional vectors. 
The similarity coefficients can be displayed in an n X n symmetric word 
similarity matrix R, where the coefficient of similarity Rj between word 

W, and word W. is 

i J 





v® 

~kmj -k~k 




A typical word-similarity matrix R, corresponding to the word- 
sentence matrix 0 , is shown in Fig. 5(b). Since R is symmetric, it is 
necessary to scan only the right (or left) triangular part in order to 
detect word pairs with large similarity coefficients. The word grouping 
procedure maybe refined by various methods, including in particular: 

(a) the use of a normalizing procedure which deletes word 
suffixes and combines the various forms of words with 
identical stems; 

(b) the use of a dictionary of synonyms or thesaurus, which 
would permit the replacement of each high-frequency word 
by the corresponding thesaurus head; 

(c) the generation of complete phrases, instead of only word 
pairs, by extracting from the text the related word pairs 
together with their context. 
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(a) Typical Word-Sentence Incidence Matrix C. 

(Cj » Sentence contains word exactly n times) 



(b) Typical Word-Word Similarity Matrix R 



Matrices Used for Word Grouping 
Figure 5 
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If it is desired to generate document associations or document 
cluster s instead of word associations, the same procedures can be used with 
some slight modifications. Instead of starting with a word-sentence matrix 
C, as shown in Fig, 5(a), it is now convenient to construct a word-docu¬ 
ment matrix F, listing frequency of occurrence of word VT in document D^. 
Specifically, F^ - n implies that document contains word exactly n 
times. 

Document similarities can now be computed as before by comparing 
pairs of rows, and obtaining similarity coefficients based on the frequency 
of co-occurrence of the content words included in the given document.*^ 

This procedure generates a document-document similarity matrix which can 
in turn be used for the generation of document clusters, for example, by 
defining a cluster as including all those documents whose similarity coef¬ 
ficients with all other documents in the same cluster exceed a given 
threshold valued 

Consider now the problem of document retrieval in the present con¬ 
text. Specifically, it may be desired to identify those documents whose 
lists of high frequency content words exhibit similarities with the terms 
used in a given search request. The preceding model can then be used un¬ 
charged by adding to the m x n document-word matrix F a special row 
which includes the terms used in specifying the search request. Specifi¬ 
cally, c 1 is set equal to w if term is used in the search request with 
weight w$ if word is not used in the given search request is set 
equal to 0. If no weights are specified by the requestor, the values of 
the elements of row are of course restricted to 0 and 1. 



A typical document-word matrix F of dimension (m+1) x n is shown 
in Fig. 6. To obtain the set of all "relevant" documents, it is only 

-JTl+1 

necessary to compute the similarity coefficients between row F^ on 
the one hand, and each of the other document rows F ,F ,,..,F , on the 
other. The procedure previously outlined can be followed, in that the 
set of relevant documents can be defined to include all those documents 
whose similarity coefficients with the search request exceed a given 
threshold value. The terms used in the search request can also be nor¬ 
malized by the thesaurus look-up method described previously for the 
high-frequency text excerpts. 



Document-Term Matrix F including Extra Row 
Specifying Search Request 


Figure 6 
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The complete procedure is summarized by following the simple arrows 
between the boxes of Fig. 7. The processing specified for the search re¬ 
quests is seen to be parallel to that used for the documents themselves. 
Furthermore, the machine program which identifies word clusters by row 
correlation of the word-sentence incidence matrix, can be used unchanged 
for the identification of document clusters, and also for the identifica¬ 
tion of relevant documents by substituting the document-word matrix for 
the word-sentence matrix. 

B. The Hierarchical Model with Synonym Lists 

The quantitative model is based primarily on the words which occur 
in the individual documents: documents are identified by sets of high- 
frequency words; they are grouped or clustered by using similarities be¬ 
tween these sets of high-frequency words; and they are retrieved by 
comparing words extracted from search requests with high-frequency words 
extracted from documents. 

In contrast, now consider a system which includes semantic asso¬ 
ciations as represented by the structure of Fig. 1(a). Each node in the 
hierarchical structure represents a subject category, and a given word 
occurring in a text may therefore be associated directly with one or more 
nodes in the hierarchy if it identifies such a subject category. More 
commonly, a word occurring in a text will be associated with one or more 
of the related word lists which are attached to the nodes of the hierarchy. 


1-19 


*v 



Simplified Procedures for the Generation of Document Identifications, 
Document Clusters and Relevant Documents 


r 


Figure 7 
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Given a high-frequency word occurring in a text, the structure of 
Fig, 1(a) can be used to identify the following types of semantic associa¬ 
tions : 

(a) the subject category or categories with which the 
given word is immediately associated; 

(b) the subject category or categories which are gener- 
ically superior, that is, located on a level imme¬ 
diately above, and linked to the nodes identified 
in part (a); 

(c) the subject category or categories which are gener- 
ically inferior, that is, located on a level imme¬ 
diately below, and linked to the nodes identified 
under (a)$ 

(d) the first-order cross references, that is, all subject 
categories directly cross-referenced by the nodes iden¬ 
tified in part (a); 

(e) the set of all related words or phrases associated 
with the nodes identified under (a) 

It is clear that the procedure for document identification, clus¬ 
tering, and retrieval can be improved considerably over those which are 
possible with a strictly quantitative model. Consider first the document 
identification : the word frequency lists identifying the various docu¬ 
ments can be replaced with subject category lists by a look-up procedure 
in the hierarchical structure, and the corresponding subject category 
indicators can be arranged in frequency order to identify the documents. 
The procedure is described by means of an example. 
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Figure 8 contains a list of high-frequency words and phrases for 
a sample document. When each of the high-frequency words is looked-up in 
a hierarchical structure of the type shown in Fig. 1(a), and the corre¬ 
sponding category list is produced as shown in Fig. 9, it may be noticed 
that a total of 15 different subject categories are identified. These 
15 categories, in turn, lead to 7 main subject headings by substitution 
of generically superior terms from the hierarchy; sociology, medicine, 
science, military science, technology, philosophy, and agriculture. The 
subject categories and the corresponding codes shown in Fig. 9 are taken 
from the Library of Congress classification schedules. 

Heart Attack 

Treatment 

Cholesterol Level 

Blood 

Young Men 

Suggest 

Female Hormones 

Female Sex Hormones 

Sex Hormones 

Effect of Sex Hormones 

Reticulo-Endothelial System (RES) 

Stimulation of RES 
Anima l 

Medica I Chemistry 
Physics 

j* 

High Frequency Words or Phrases in Sample Document 
Figure 8 

^ See H.P. Luhn, "Auto-Encoding of Documents for Information Retrieval 
Systems." Modern Trends in Documentation , M. Boaz (editor), Pergamon 
Press, 1 9$9~. 
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H (Sociology) 

HV Young Men (social pathology) 

HQ Youth (family, marriage, women) 

HF Suggestion Systems (commerce) 

HQ Female 

HQ Men 

R (Medicine) 

RC Heart Attack (medical practice) 

RC Blood Diseases (medical practice) 

RC Sex Hormones (endocrinology) 

RS Medical Chemistry (pharmacy) 

RB Blood Chemistry (pathology) 

HM Stimulants (therapeutics) 

R Laboratory Animal (research) 

Q (Science) 

QP Cholesterol (physiological chemistry) 

QP Blood (physiology) 

QP Blood Chemistry (physiology) 

QP Sex Hormones (physiology) 

QP laboratory Animal (comp* physiology) 

QP Cholesterol (chemistry) 

QH Physical Research (physics) 

QP Reticulo-Endothelial System (physiology) 
QL laboratory Animal (zoology) 

U (Military Science) 

TJG Attack (military engineering) 

T (Technology) 

TA Leveling (engineering) 

B (Philosophy, Religion) 

BJ Young Men (ethics) 

S (Agriculture) 

SF laboratory Animal (animal culture) 


Subject Categories Obtained for Sample Document 
Figure 9 


The subject heading "military engineering," for example, is 
obtained by cross reference from the high-frequency word "attack," simi¬ 
larly, "agriculture" is obtained by cross reference from "laboratory 
animal." Four of the seven major subject headings can be eliminated 
because they are identified with insufficient frequency. The remaining 
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major subject headings are f, science, w "medicine,* and *sociology , n respec¬ 
tively, and the principal subject categories in decreasing frequency order 
are 

physiology, 
medical practice, 
family and marriage, 

and so on. 

The list of subject categories produced by the look-up proce¬ 
dures in the hierarchical structure can be used not only for purposes 
of document identification, but also for document clustering and retrieval . 
Specifically, lists of high-frequency words or phrases together with the 
relevant subject categories are obtained as before from the hierarchical 
structure for a given document collection. These phrases are then listed 
in a document phrase matrix 0, similar in nature to the document-term 
matrix F of Fig. 6. However, whereas each applicable term is listed only 
once and is thus represented by a single matrix column in F, all semantic 
contexts are provided for each phrase in G by listing the phrases as many 
times as there are applicable subject categories. A given phrase may 
therefore be represented in G by several matrix columns. Thus if matrix 
element G^ ■ n, it is implied that a given phrase occurs n times in 
document D^; the first subscript attached to the phrase refers to the 
phrase number, and the second to the subject category of the phrase. 

Figure 10 shows a typical document-phrase matrix G in which the phrases 
have been ordered in such a way that all items belonging to one and the 
same subject category appear in adjacent matrix columns. 




Document-Phrase Matrix with Specified Subject Categories 

Figure 10 

Document clusters can now be generated by using the procedures pre¬ 
viously described for the quantitative model, except that matrix 0^ is sub¬ 
stituted for F. Specifically, a document-document similarity matrix is 
obtained by pairwise comparison of rows of _G, and clusters of documents are 
defined as a function of the document similarity coefficients as before. 

The procedure for document retrieval is also parallel to that previously 
described in that an extra row specifying the search request is added to 
the phrase-document matrix G* Row (f 1 *^ is then compared'igainkt all other 
document rows to obtain an n-dimensional vector of similarity coefficients, 
and documents with sufficiently large coefficients are considered to be 
relevant to the given request. 

The request vector a”* 1 is constructed by substituting for the 
words used in a given search request the category indicators and first-order 



cross references found in the hierarchical structure, and by labeling the 
appropriate matrix elements in row Q m+ \ The value of the individual 
elements may be either 0 or 1, or alternatively a weight w may be assigned 
as previously explained. The row correlation program which is used to 
identify the relevant documents, as well as the word and document clusters 
is the same as that used for the quantitative model. 

The complete procedure is again summarized in the chart of Fig. 7 
where the double arrows ere to be substituted for the simple arrows 
wherever an alternate path is provided. 

C. The Hierarchical Model with Tree Structure 

The preceding model, even though providing for the addition of 
semantic associations by introducing new words related analytically or 
generically to the ones originally used in search requests and documents, 
is still primarily based on the word as a unit of information. In par¬ 
ticular, two documents are assumed to be related, if the words originally 
contained in them, or their respective substitutes, can be matched. The 
same is true for the matching of requests with documents. Phrases, rather 
than words, can of course be used, as illustrated by the document-phrase 
matrix of Fig. 10. However, in the hierarchical look-up process used up 
to now it is assumed that two phrases can be matched if they contain one, 
or two,... or n words which are identical (except for the permitted 
semantic substitutions). The syntactic structure of the phrase itself has 
not so far been taken into account* this makes it possible to match dis¬ 
tinct concepts such as for example, "school children, 11 "children in school, 
and "school for children." 
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The internal structure of the phrases can be utilized, and a more 
accurate matching process can be obtained by providing syntactically ana¬ 
lyzed phrases and requiring matches of the individual words as well as of 
their syntactic structure. This is accomplished by using the model of 
Fig. 1(b) in which the sets of related phrases or criterion trees are 
listed in syntactic tree form. The principal modifications introduced in 
the over-all process by the structural analysis are shown in Fig. 7 by 
means of boxes interconnected by triple arrows. 

The process starts as before by a word frequency count and a nor¬ 
malization procedure designed to combine the varying forms of words with 
similar stems. Excerpts containing a large number of high-frequency words 
are then extracted from the original documents. Instead of extracting 
only individual words, it is convenient to use larger units, such as 
clauses or complete sentences. A syntactic analysis is then performed 

of both the search requests and of the extracted parts of the documents, 

12 3 

and the analyzed output is represented in tree form. 9 9 Instead of 
physically manipulating the two-dimensional tree form, it is convenient 
to attach to each word of the text a syntactic code which exhibits the 
syntactic function of the given word and also the syntactic dependency 
structure. The principal words of the sentence, such as main verb, sub¬ 
ject, object, and so on, are assigned short strings of code characters, 
and the remaining words which are syntactically dependent on them are 
assigned longer strings. Thus, the length of the character string attached 
to a given word signifies ’’depth 1 * of syntactic dependence and therefore 

3 

depth of location in the corresponding tree structure. A sample excerpt 




is shown in column 1 of Fig. 11 and the corresponding syntactic character 
strings are given in column 3 of Fig. 11. 


1 

2 

3 

h 

5 

Excerpt from 
Original Text 

Subject 

Category 

Coded Syntactic 
Tree Structure 

Matching 
Code String 

Matching 
Criterion Tree 

Manor 

of 

the 

new 

citizens 

began 

to 

perceive 

the 

schools 

as 

a 

means 

by 

which 

their 

children 

might 

have 

a 

fuller 

life 

Hence 

9 

in 

addition 

to 

educating 

for 

intelligent 

citizenship 

e 

_ I _ 1 

[person] 

[consider] 

[education] 

[means] 

[person] 

[desirability] 

[world] 

[education] 

[world] 

IS 

1SPQ 

1SP0A 

1SP0A 

1SP0 

IV 

1VPR 

1VTV 

1VI0A 

1VT0 

1VFVPR 

1VIVP0A 

1VFVP0 

1VPVP07VPR 

1VPVP07VP0 

1VPVP07SA 

1VPVP07S 

1VPVP07VX 

1VPVP07V 

1VPVP070A 

1VPVP070A 

1VPVP070 

1. 

1VD 

1, 

1VPR 

1VP0 

1VP0PR 

1VP0P0G 

1VP0PJ2 OPR 

1VP0P0GP0A 

1VP0P0GP0 

-V n 

-0 :l .f-Qs 
-PRs( [ -Qs 

-P0: J 

-PR: 

-P0: 

—.——J 

[education] 

(He) 

[education] 

for 

(n.) 


Sample Tree Matching Output 


Figure 11 
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Following the syntactic analysis, the words and associated code 
strings extracted from documents and search requests are compared with the 
criterion trees included in the hierarchical structure, and the semanti¬ 
cally associated information, including cross references and generically 
related items, is extracted from the hierarchy whenever a match is found. 
The operation required to find matches between word strings on the one 
hand, and criterion strings on the other, is however not as simple as the 
word matching procedure used before. Indeed, only a ana 11 number of 
general criterion trees are provided, each of which may be matched with 
a multiplicity of actual word strings. This is achieved by associating 
general terms, such as subject categories, with the nodes of the crite¬ 
rion trees, and by leaving the syntactic structure of the criterion trees 
largely unspecified. "Variable" characters are therefore introduced in 
the syntactic code strings associated with the criterion trees, and each 
"variable" character can replace a specified set of "fixed" syntactic 
characters of the type shown in Fig. 11. The dashes used in the code 
strings of the criterion trees shown in Fig. h are examples of such 
variable characters. 

In order to find a match between a given word string extracted from 
a document and its associated sequence of fixed syntactic codes, and a 
given criterion string and its associated sequence of fixed or variable 
code characters, the following operations are therefore necessary. 
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Each word in the word string must be looked up in the 
hierarchical structure, and replaced by the corre¬ 
sponding subject heading (this operation is equivalent 
to the thesaurus look-up described in connection with 
the quantitative model). 

Each variable code character associated with each term 
in the criterion tree must be replaced by a sequence 
of fixed characters. 

A match is then obtained between a given word of a word string and a 
given term of a criterion string if the associated subject headings as 
well as the associated fixed character strings are identified. Further¬ 
more, a complete criterion tree will match a complete sentence extracted 
from a text if there exists a sequence of fixed characters such that 
each term in the criterion tree matches at least one word in the sen¬ 
tence in the same linear order. 

Given any sentence, it is necessary to test in order all the 
criterion trees which contain any of the relevant subject categories. 
Various strategies are possible for matching sentences (word strings) 
and criterion trees. The most immediate method consists in taking the 
first term in the first criterion tree and comparing it in turn with 
each subject category associated with the words in the sentence until a 
match is found. The next term is then taken and matched against all the 
subsequent words in the same sentence, and so on, until either the terms 
in the criterion tree or the subject categories associated with the sen¬ 
tence are exhausted. If the sentence is exhausted before the criterion 
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tree, no match can possibly exist, and the next criterion tree must be 
processed. If the criterion tree is exhausted before the sentence, a match 
may or may not exist. The procedure will in general permit the matching of 
long sentences with short criterion trees, since many of the words in a 
sentence are disregarded in the process. Thus, it is always possible to 
disregard subordinate clauses, adverbial or prepositional phrases, or any 
other sentence parts which may not be included in any given criterion treef 
The strategy used to assign sequences of fixed code characters to 
the variable characters associated with the criterion trees is similar to 
the previously described method for matching word and criterion strings. 

The variable characters are processed from left to right and are initially 
assigned the shortest possible fixed character strings. This process is 
continued until either the given character string is exhausted or it is 
determined that the complete assignment of fixed character strings to all 
variable characters is impossible. In the latter case a new assignment 
is tried by using successively longer fixed strings to replace the variable 
ones. As an example, the fixed character string n lSP0A ,! matches the string 
n -S*A n , by replacing the variable characters ”- M and respectively by 
the fixed characters w l w and M F0". 

Two matches are found when the word string illustrated in Fig. 11 
is compared with the criterion trees of Fig. U. Ihe subject headings 
which replace some of the words in the sentence are shown in column 2, and 

*tk different matching procedure which uses the graph theoretic properties 
of the syntactic trees and finds the subgraphs of a given graph instead 
of proceeding on a node-by-node basis has been programmed by E. H. 
Sussenguth.7 
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the matching code strings and criterion trees are shown in columns k and 
!> of Fig. 11 respectively. 

Following the look-up procedure in the hierarchical structure and 
the comparison of sentences with criterion trees, the item-sentence and 
item-document incidence matrices are constructed as before. Each item is 
now a criterion tree instead of a single word, and a given sentence or 
document is characterized by the matching criterion trees contained in it. 
The remainder of the procedure outlined in Fig. 7 is also followed un¬ 
changed in that the item-item similarity matrix is used to generate item 
clusters; the document-document correlation matrix similarly generates 
document clusters, and the item-document matrix augmented by the request 
vector determines documents which are relevant to a given search request. 

The complete process can of course be kept more or less flexible 
by assigning completely variable code strings to the terms of the cri¬ 
terion trees, or alternatively by not permitting any variable code 
characters at all. Similarly the criteria used to match word strings 
with criterion strings may be made more or less restrictive. 

h• Summary 

Three models have been described for automatic document retrieval 
systems. The first one is based largely on the words used in the docu¬ 
ments and search requests. The second adds semantic associations by 
introducing a hierarchical structure consisting of generic inclusion rela¬ 
tions, cross references, and lists of related words. Finally, the last 
model also exhibits syntactic associations through the introduction of 
syntactic tree structures. 
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The programs needed for document clustering and retrieval become 
increasingly complex as semantic associations and syntactic structure are 
taken into account. The following programs are specifically required: 

(a) itemization of linear text; 

(b) normalization of word forms by suffix cutoff and 
condensation of words with identical items; 

(c) word frequency count; 

(d) extraction of words, phrases, or sentences including 
high-frequency words; 

(e) thesaurus look-up process to replace a given item by 

- the corresponding subject heading; 

7 (f) construction of item-sentence and item-document 
incidence matrices; 

(g) row-correlation process to obtain word and document 
clusters, and to compute a relevance index for documents. 

The hierarchical models with synonym lists and criterion trees are 
based in addition on the following programs: 

(h) a hierarchical look-up process designed to identify generi- 
cally superior or inferior items, cross-referenced items, 
and lists of related items; 

(i) a syntactic analysis procedure which furnishes the syn¬ 
tactic tree structure of each sentence; 

(j) a tree matching process which compares a fixed tree 
structure representing a word string with a variable 
structure representing the criterion trees. 

A useful comparison of the three models is difficult to perform, 
since the usual criteria of efficiency including time of execution and 
internal computer storage requirements are obviously not sufficient by 
themselves. Some measure of goodness or accuracy of retrieval is also 
needed. If speed and storage requirements alone were of importance, the 
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quantitative modal would clearly ba most efficient. Comparative tarts 
with actual document collections are needed to determine whether the more 
complicated hierarchical look-up and syntactic tree matching processes 
provide a sufficient improvement in retrieval accuracy to warrant the 
cost in both time and equipment required by the hierarchical models. 
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II. AUTOMATIC RXTXBBiCX ANALYSIS 
Margaret Thompson 

1. Introduotion 

With the development of versatile character recognition equip¬ 
ment, it becomes increasingly possible to use as basic machine input 

.unedited running text as it appears in books; magazines, or journals. 

Bibliographic references or citations constitute an integral part of 
most technical artioles, and methods must be available for processing 
such bibliographic citations automatically. In particular, it becomes 
necessary to identify similar, or equivalent citations, and to rearrange 
a reference list into some specified order• The difficulties arising in 
this connection are largely due to the extreme variability in format and 
to the lack of standardisation which prevails in the publication of 
citations* 

The present program takes bibliographic citations and automatically 
arranges them into a standard format , in such a way that the various parts 
of the citation are unambiguously identified. These standardised citations 
can later be processed by sorting and matching procedures to identify 
similar oitations and to effect various rearrangements. 

2. General Categories of a Bibliographic Reference 


r 


All references oan be manually separated into nine or fewer general 
categories. The present program handles this separation automatically with 
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• Minium of manual preparation of the input data* The referenoa 
oat ego ri# a to' be recognised are as follows s 

(a) author(a), including initial(s), Jr., III, etc*, 

(b) title of paper or book, 

(c) title of journal, 

(d) volume number, 

(e) issue number, 

(f) page number, 

(g) name of publisher, 

(h) city of publisher, 

(i) year of publication. 

411 given input information ia classified under one of these 
headings} however, one or more of these categories may be missing from 
a legitimate referenee* 

3* Automatic Category Selection 

The first part of the program reads in the oomplete data from 
a single reference. Fields are pieces of data information ending with 
some punctuation mark. Categories contain one or more fields ending 
with some key punctuation nark. Categories are identified and formed 
in the order described in Part 2. From the oomplete reference source 
consecutive fields are examined to determine if the data fit into the 
description of the oategory being formed. It say happen that a part of 
a oategory is read and further fields are neoessary to complete the 
category. _ 
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In the author category, certain defined formats are likely to 
begin the oitation. The noet common fomata are teated first such as 
one or two initials preceding the authors last nane, or the authors 
last nane followed by one or two initials. The more subtle, less 
frequently occurring formats are then tested; in particular in a series 
of last naaes of authors, the word "AND" is searched for lanediately 
before the final author. 

A typical autho r c at e g ory f or mat ion wight bet firs* field 
contains an initial, *1."; the second field contains the authors last 
nane, "NAMF; the third field contains the title of a paper, 

"Computers.Constants." Fields one and two are oolleoted, put in 

a standardised format, and entered in the Author Table as "KAMI, I." 
Field three did not contain "Jr." or another author. Therefore this 
is the first data field considered in formulating the next category. 

If more data are.necessary for the second category, additional fields 
are read from the complete reference source. This procedure continues 
until all nine categories are formed or it is determined that one or 
more of the categories are vacuous. If a category identifier is found 
before the testing occurs, the pertinent data are identified and saved, 
until that particular category is being formed. 

In oertain oases, special words are also used to help identify 
the eategery to which a field belongs. For example, "vol" Identifies 
the data with the volume number category. 

If a blank word is read or all nine category tables include at 
least one entry, the reference has been fully processed. The program 
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reoyoles to prooess the next reference la th# bum manner. Soo 
Appendix A for sort detailed flow diagrams of ontiro progm. 

4* Conventions 

Two types of rootrlotlono are required: Input and logical• 

A. Input 

To d iwtingola h unmarked numerical field* and to over come the 
limitations of the character set available on the 2BIK 026 keypunoh 
machine, the following conventions are Imposed on the input data: 

(a) dollar sign ($) is inserted before any boldface. This 
sign indioates that all data up to the next punctuation 
mark are entered in the Volume Table) 

(b) asterisk (*) replaces colons and semicolons* This 
will distinguish an otherwise unmarked (absenoe of 
speoial words, "vol“ eto.) volume number and issue 
number from an unmarked page number. If there is a 
oomma in the body of a title, the title category 
usually ends with a colon or semicolon. The replace¬ 
ment of the oolon or semicolon by an asterisk will 
permit ooaplete title selection in this situation; 

(o) a series of two or more minus signs (—) replaces 
a dash. Only the 11-punoh minus sign will be 
recognised; 
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(d) • single (') or double (") apostrophe replsoes a 
quote; 

(e) Renan minerals are converted to arable numbers. 

B. Logical 

The known logical Unitations within the prograa Itself are 
as follows: 

(a) there aust be an author and/or a paper (book) title 
for each reference processed) 

(b) editors are treated as authors) 

(o) translators: when given, are ignored. The original 
author(s) only are processed) 

(d) occasionally a cited journal article is referred to 
only by page nunber, rather than by the actual title 
of the article. This Journal title is listed in the 
"Title of Paper (Book)" Table. No entry is made into 
the "Title of Journal" Table. The journal title is 
substituted for the missing paper (book) title) 

(e) a leading long dash beginning a reference indicates 
that this citation has the sane author as the citation 
Immediately preceding. For this reason, the first 
reference to be prooessed cannot begin with a dash. 

5. Input Data 


The following list represents sons typical input examples as 





they appear after m an u al pre-editing. The titles have been shortened 
.where a series of dots oocurs. 

1* Cook, C* B., "Modification ... Forms," Proo. 1958 
Matl. lleotronios Conf., pp. 108-1067. 

2. Hildebrand, F. B. * Introduction to Mumerioal 
Analysis. McGraw-Hill, Mew York, 1956. 

3* Cecil Hastings, Jr., Approximations ... Computers. 
Princeton Univ ersity Press, 195 5 • 

4* S. Sobeobter (id.), Nuclear ... Newsletter, 

New lork University. 

5* Cesaro, Brnesto * Binleltung ... Rechning; 

translated by G. Kswalewski. Teubner, Leipsig, 1922 

6. Memo from &. B. Reddy to A. G. Carlton, The 
Computer ... Constants, JHU Applied Physics 
Laboratory, April 6, 1954. 

7. Copi, 1. M., Bigot, C. C. and fright, J. B., 
Realisation ... Brents. J. Assoc. Coaput. Mach. 5 
(1958), 181-196. 

8. Courant, Freidricks, Lewy, G. B., Uber ... Physio, 
Math. Ann. 100, 32 (1928). 

9* British Association for the Advancement of Science, 
Mathematical Tables, Cambridge Univ. Press, 1952. 

10. --, Generation ... Computers. Math. Tables ... 

Aids Coaput. 11 (1957), 255-257. 

Typical Input 
TABLB 1 




11. ^cyclopaedia Brltannloa. 24 vols. 1944* 

12. M. A. Aoiel, "Tha Iffeot ... Priorities," Opus. 
Has. 8 , 730-733 (I960). 

13* P. M. Morsa and H. Feshback, Methods ... Physios, 
Part 1 and 2, MoGraw-Hill, Maw lork, 1953. 

14* V. lather and V. Sangran, "Abstracts -MR Codas," 
Communications ... Machinery, 2*1 (January, 1959)« 

15. Aitkan, A. C., Determinants and Matrices. 

Edinburgh * Oliver and Boyd, 1948. 

16-17. P. A. M. Dirao, Proc. Roy. Soo., A 133, 60 (1931); 
Pbya. Rev. 74, 817 (1948). 

18-19• Rosenblatt, F., (a) Tha Perception ... Systems. 
Cornell Aero. Lab., Inc., Report No. VQ-1196-G-1 
(1958). (b) Two Theorems ... Perception. Cornell 

Aero. Lab., Inc., Report Mo. V0-U96-G-2 (1958). 

20-21. P. A. M. Dirac, Proc. Roy. Soc., A 133, 60 (1931) * 
see also F. London, Superfluids, liley, Mew lork, 
1950, Vol. 1, p. 152. 

22. Biekley, V. 0., See Temple and Bickley 

23* Samuelson, P. A., 'Iterative ... Roots , 1 Joura. of 
Math, and Phys. 28 (1949), 259-301. 

24 , --, 'A Simultaneous ... Equation , 1 ibid. 242 

(1950). 

25. Letting Bills Pay Themselves, Business Week, 

October 27, 1956. 


TABLI 1 (continued) 
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6 * Processed Table Listings 

Tbs input sxasplss given in Part 5 are listed in Table 2 as 
they appear as the result of the standardisation procedure* 


Author(s) 

1 i 

Title 
of Paper 
(Book) 

Title 
of Journal 

Jol. 

No. 


Page 

No. 

Publisher 

City 

Tear 

X* Cook, Ca Be 

Modifica¬ 
tion «• • 
-Forma. 

Proc. 1958 



108- 

1067 

- . -. 


' 

2* Hildebrand, 

F. B. 

Intro¬ 

duction 





McGraw- 

Hill 

New 

lork 

1956 


O *0 

Analysis 








3. Hastings, 

C., Jr* 

Approxima¬ 
tions sea 
Computers 





Princeton 

University 

Press 


1955 

4* Scheohter, 

S* (Id.) 


Nuclear 
Codes News¬ 
letter 


* 


New fork 
University 


I 

i 

5* Cesaro, B. 

Binleitung 

• a a 

Rechning 





Teubner 

Leip¬ 

zig 


6 • Reddy, R • B • 

Computer 

a a s 

Constants 





JHU 

Applied 

Physios 

Laboratory 

, 

, 

1954, 

April, 

6 

7* Copi, I. M. 
Bigot, C. C* 
Wright, J. B. 

Realisa¬ 
tion ... 
Bvents 

J. Assoc* 
Coaput. 
Maoh. 

5 


181- 

196 



1958 

i— 


Category Tables After Processing 
TAB LI 2 










IM 


Author(a) 

Titls 
of Pspsr 
(Book) 

Titls of 
journal 

Vol. 

Bo. 

Iasua 

No. 

No. 

Publisher 

City 


8 . Courant 
Priedrioks 
L«qr 

Uber ... 
Physik 

Math. 

Annals 

100 

32 





1928 

9. British 
Association 
for ths 
Advancement 
of Soienoe 

Mathenati- 
esl Tshlos 





Casbridge 

University 

Prase 


1952 

IQ* British 

Assooistlon 
for ths 
Advanoament 
of Soisaoo 

Generation. 

OSS 

Cosputors 

Hath, „ 
Tables 
... Aida 
Cosput. 

11 


255- 

257 



1957 

11 . 

Bnoyolo- 

psodla 

Britamioa 


24 

vols. 





1944 

12. Aosel, M. A. 

Bffoots ... 
Priorities, 
Tho 

Opus* 

Bes. 

8 


730- 

733 


1 

I960 

13. Morse, P. M. 
Peahbaok, H. 

Mothods 
... Physios 


Part 1 
and 2 



McGrse- 

Hill 

Maw 

Xerk 

1953 

14 • lather, ¥. 
Sangren, V. 

Abstracts - 
■ B Codas 

Coasunl- 

oatlona 

2 

1 




1959 

January 



SOS 

Machinery 







15. Aitken, A. 0. 

Deter¬ 

minant* 

and 

Matrloas 





Oliver 

and 

Boyd 

Bdin- 

burgh 

1948 


TAB LI 2 (oontinaed) 
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A preliminary output is planned to toot accurate category 
separation after prooeMing of eaoh reference. laob oategory of a 
reference *111 be Hated on a separate line in the order desoribed 
la Part 2. All oategory listings will be indented except the author, 
and spacing within a reference listing indicates that no information 
was given or processed for this particular category. 

Inputi Cook, C. X., "Modification ... Pons," Proo. 1958 
Natl. Xlectronios Conf., pp. 108-1067. 

Prellnlnary Output ; 

(a) Cook, C. I. 

(b) Modification ... Forms 

(c) Proo. 1958 Natl. Xlectronios Conf. 

(d) 

(•) 

(f) 108-1067 

(g) 

(h) 

Pinal Output : See Part 6, Table 2 and Part 8, Table 3 

7. Explanation of Table formats 
A. Nine Categories 
Authors 

(a) Author's last name occurs first, followed by initials. 
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All examples.^ 

(b) Jr., Ill, etc. art included• 

Xxaaple 3* 

(o) Idltort art treated ae author*. 

Xxaaple 4. 

(d) Translators art ignortd. 

Xxaaplt 5* 

(t) Private ooaaunicationa, memos, eto.jthe originator 
(from) is oonsidtrtd the author and ptrson(s) 
addrtsstd (to) are ignored* 

Xxaaplt 6. 

(f) If aore than one author is cited, all authors art 
listed and the rest of the reference information 
is listed after last author. "AXD" occurs before 
last author. 

Xxaaplt 7. 

Xxoeption no "AID*. Only last nates of multiple 
authors are given and the title is greater than 
one word. 

Xxaaple 8. 

(g) If special words (Staff, Assooiation) indicating 
multiple authors occur, this field is oonsidtrtd 
the author. 

Xxaaple 9* 

^ Xxaa p les refer to the rtfertnot numbers in Table 1 and Table 2. 
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(b) An initial long daab refora to previous author and all 
author information from previous reference is repeated. 
Xxamples 10, 24* 

(i) A reference has no given author if the special words 
■Tables," "Inoyolopaedia," "Dictionary,* or *0 .S." 
appear in the first field. 

Xxaaple 11. 

Title of Paper {Book) 

Book titles (Xxaaple 2) and titles of papers appearing in 

journals (Xxaaple 1) are listed in this category. 

Title of Journals 

This category lists all titles of journals (Xxaaple 2) 

and has no entries froa books (Xxaaple 2). 

Volume lumber 

(a) All information after a boldface indicator (|) up 
to the next punctuation mark is considered the 
volume number. 

(b) All integers preceding a colon indicator (*) are 
considered the volume number. 

(o) An oocurrenoe of one of the following special words 

with or without prefix or suffix integers will indicate 
a volume number entryi vol., v., part, pt., diary, 
report, rept., Beport no., paper, technical note, 
teoh. note, T. X., thesis, doctoral thesis. 

Xxanples 11, 13* 
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Issue Humber 

(a) If a numerical field ooours after the volume number 
or colon indicator (*) and is not the date or page 
number, it is entered in the Ieeue Humber Table. 

(b) in occurrence of one of the following apeoial words 
with or without prefix or suffix integers will 
indicate an issue number entry: edition, ed., series, 

..MBtar* a®*....—.—.-. 

Page Humber 

(a) in ooeurrenee of one of following speoial words with 
or without prefix or suffix Integers will indicate 

a page number entry: p., pp., section, obapter(s), 
ohapt., monograph. 

(b) iny number that has not been processed up to now, is 
not the date and has a dash,is considered the page 
number. 

Xxample 7. 

Publisher and City 

Publisher and city occur only for books. 

(a) iny field of nonnumeric characters (excluding 
months and H.D. no date), not processed up to 
now, is considered the publisher and/or eity of 
publication. Most referenoea have the publisher 
first, followed by the city of publication* 
tsample 2. 
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(b) Occasionally the city appears before the 
publisher, therefore a aaall table of suoh 
cities will be searched prior to oalling this 
field the publisher. 

Mraaple 15 • 

(o) in occurrence of one of the following special 
words with or without prefix or suffix data will 
in di ca t e n publi she r aad city eatryi lab o rat o ry , 
lab. 

lear 

(a) Zf *M.D." (no date) is present there is no date entry. 

(b) Any four oharaoter integer not prooessed up to now, 
beginning with 18— or 19—, is considered the year 
of publication. 

Ixaaple 2. 

The aonth and day of aonth is included in the year 
entry, if present. 

■xaaple 14 . 

B• Multiple Citations 

Multiple citations occur when a single reference contains two 
or aore cited articles or books by the same author or each cited work 
nay hare a different author. Multiple citations appear in two general 
fozast the saae author and aultiple authors. There are two speoiflo 
formats for eaoh of these foxas. These four formats, if present, oust 
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be identified at the beginning of each category to ba prooaaaod 
after the author* 

gage Author - two or sore citation® 

(a) Repetition of boldface, 
lxanples 16, 17. 

(b) Occurrence of the for* “(a)_ (b)_ n . 

If either of these two conditions occurs the sane 

. - a ut hor is a g altt en t ere f lntb tie author table dad 

the information following the repeated punctuation 
is processed, 
lxanples 18, 19. 

Multiple Authors - aultiple citations 

(a) "see also” The processing of a new reference begins 
if this special word is found at the beginning of a 
new oategory. 

Sxaaple 20, 21. 

(b) "ibid." - in the sane place 
"loo. oit." - in the sane location 
"op. cit." - in the sane work 

The occurrence of these speolal words indicates the 
sane book or journal source as the previous citation 
and this infomation is repeated. 

Ixanple 24* 
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C* Descriptive Titles 

If the first field of the citation (up to first punctuation nark) 
contains four or sore words, after tests for initials and all speoial 
words likely to ooour in the author field have failed, a descriptive title, 
having no given author, is assumed* This inforaation is entered in the 
Title of Paper or Book Table. Xzanple 25. 

8. Output Foreat 

A. lomalisation 

lormalisation is necessary to standardise the output fro* this 
part of the progran before entry to a SORT progran. The following 
standardisations have been adopted* See Parts 5 and 6, Tables 1 and 2. 

(a) Authors are given with last naae first, followed 
by any Initials. 

(b) The title of a book or paper has leading "The" or 
"A" at the end, otherwise sane as original source. 

(o) The original fom,of the journal title is kept even 
if abbreviated. This way have to be aodified as sous 
journals are abbreviated in sore than one way and 
this nay interfere with the later watching prooess* 

(d) The words vol. or v. will not be used in the Volume 
lumber Table if they oocur before the integer* 

(e) All special words for the issue nuaber are inolnded 
in the Issue limber Table. 




(f) The words p. or pp. will not be used in the Page 
Busbar Table. 

(g) The publisher and city are now prooessed as they 
exist. This nay need aodifieations as abbreviations 
■ay oocur. 

(h) The year is entered first into the Iear Table, followed 
by nonth and day, if present. 

B. Tables. .. ' 


(a) Unit 


For eaob referenoe prooessed, a naxiaua of one data entry is 
made to eaoh of the nine category tables with one exception, laoh 
author of a multiple author reference is entered separately in the 
Author table. Otherwise, there is a aaxiaua of nine data entries per 
reference. The next referenoe is then prooessed and an addition aay or 
nay not be nade to eaoh of the nine tables. After all references are 
prooessed, there are nine separate tables in aenory of varying length. 
Bach of these tables aay be prooessed as a unit (e.g., alphabetical 
listing of all authors from Author table). 

(b) Modification 


To nake it poaaible to associate all parts of a single 
referenoe (given a title, get the author(s), etc.), some identification 
has to be attaobed to eaoh table entry. Data entry to any category 
table generated from a single citation would carry the saae identification 
miaber. A sequential numbering or ohalning increased by one for eaoh 



n-19 


new oitation processed, must be handled during the category formation 
of eaoh reference. The chain link* the eeparated information from the 
same eouroe. Note that only multiple authors hare more than one data 
entry per table from the sane reference (same identification number). 
After all references are processed, the tables are collected and an 
output tape is made containing the data with identification. 

Table 3 shows how the category tables including identification 
will a pp ea r ia neme ry after aH r e f e r e n ces hare b ee n processe d .. 

Table 4 shows the final output tape. The number of files on 
the tape equals the number of category tables to which entries hare been 
made during processing of all citations. The first record of eaoh file 
oontains the table name. The first word of a record contains the 
identification number. In this way, eaoh separated category can be 
worked with independently (e.g., all authors) or, using the entire output, 
parts of a single reference nay be extracted. 

9* Summary 

This program, when completed, should identify and normalise all 
references into the categories described earlier, and set these up as 
tables readily available for use by a SORT program. The automatic 
analysis presupposes the absenoe of almost all pre-editing and manual 
determination of information content of the input data. A SORT program 
oan later be used to extraot from these prepared tables those items 
which obey some given search criterion. For example, authors may be 
listed alphabetically, multiple authors and citations may be extracted, 
and many other rearrangements nay be effected automatically. 
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Author 

Title ' 

Paper, Book 

Journal 

Title 

7ol. 

Issue 

Page 

Publisher 


Tear 

/ 

1 

- 1 

3 


1 

2 


2 

?1) Cook, C. X. 

Modification*• 

Proc. 1958 

5 


108- 

Princeton 


1955 



Mat. Elec- 



1067 

Oniver- 





tronics 




sity Press 





Conf. 

j 





2 

2 

3 

j 

I 

3 



3 

3) Hastings, IJYJr. 

Approximation® 

J.Asaoce~ 4 

--- 

181- 

-.* . 

— 

IS 58 



Compt. Maohe 

i 

! 

196 




3 

3 








7) Copi, I. M* 

Realisation... 










i 







3 


i 







UgOt| Ce Ce 









3 









flight} Je Be 




1 







Category Tables After Processing 
TABLB 3 


example numbers refer to reference numbers in Tables 1 and 2. 
^Identification makers. 


) 
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Category Table* After Chaining 
TAB LI 4 


( 


Identification nuabars 
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APPMDIX A 

Flow Charts Identification and Standardiaation of Nino Heference 
Catagoriaa. 

Main Frogran: Raada in data of noxt reference* 

Controls sequence of 9 wain subroutines* 

Mora references to propose Mo »arc 
Authors—'las 
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Author (continued) 


i 

ID OOt- 

1*1 

Jr 


No 


Zoo 


form: Alan Aitkon 
to Aitken, A« 


J 

lot field 

2 wordo 


rs NO 



loo 


- I 

Initial 

Mo 

lot naao ia 
apellod out 


nort field 
Initial 


loo 


No 


Invert 

form: 

A. Aitkon 


to Aitkon, A. 


L 


1 ontor Author Table 


PAPK& Category 















Journal 


Special wordsi 
Ibid. In, Paper- 


11 - 2 $ 


This field or next field 

Speoial word a - rol-- 

I issue- 

Publ.- 


nuaerio (no dash)—— 
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r. 


City 


Monnuaerio- 


Jit 


_x- 

fater City Table 


lear Category 


tear 


Tee l fonM 
Nuaerlo— >\ 1952 


Ho 


(January* 1952 
[January 12,1952 


lea 


Change to fora: 1952* January, 12 

Jl 


Inter Tear Table 


Any no re infoleft 

1,0 r 


j 


i 


Prgnt, aleoellaneoue 


Go to Main Prograa to prooeea next referenoe 

Also at beginning of each of the nine categories two testa are always aadet 

(1) Any datp in Save Tables for this category 

Mo 


Tea 


lea 

Process for this category 


(2) Any sore information this referenoe 


1 

Clear^Save Table of this information 

JU 


k 


Ho 


Go to Main Prograa to read in next 
referenoe 


Continue with 
processing of Category 


f 
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III. THIS USB OF TREE STRUCTURES FOR PROCESSING FILES 
Edvard H. Sussenguth, Jr. 

ABSTRACT 

In data processing problems, files are frequently used which 
oust both be searched and altered. Binary search techniques are 
efficient for searching large files, but the associated file organi¬ 
sation is not readily adapted to the file alterations. Conversely, a 
chained file allocation permits efficient alteration but cannot be 
searched efficiently. A file organised into a tree-like structure is 
disoussed, and it is shown that such a file nay both be searched and 
altered with tines proportional to s log fl M, where N is the nuaber of 
file itens and s is a paraneter of the tree. It is also shown that 
optimising the value of s leads to a search tine which is only 25 % 
slower than the binary search. The tree organisation employs two data 
chains and may be considered to be a compromise between the organisation 
for the binary search and the chained file. The relation of the tree 
organisation to multidimensional indexing and to the trie structure is 
also discussed. An example of an automatic dictionary for language 
translation is used to illustrate the principles involved. 

1. Introduction 

In many data processing applications large files of information 
mast be searched to extract some pertinent data and new data must be 
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added to the file* There are many ways to perform theae manipulations 
depending upon the gtrueture of the file and the oharaoteriatiea of the 
computer* When it ie necessary to both aearoh and alter the file, a 
sorting procedure is frequently employed in conjunction with the searoh 
teohnlque to keep the file updated. Another attack is te avoid time* 
consuming sorting by allocating the file in the computer memory so that 
alteration is efficient; the searching of such a file is usually difficult, 
however• Several strategies for such problems are reviewed and analysed 
below* The body of the paper, however, is oonoeraed with a method of 
allocating (and implicitly sorting) the items of a file so that the file 
may both be searched and altered efficiently. 

Before examining the detaila of the proposed techniques, a sample 
problem will be used to demonstrate the relative efficiency of the 
procedure when oompared with other sorting and search methods* Specifically 
it is desired to design an efficient system to produce a listing of all 
distinct items from a given list of items.^ Clearly this is a problem of 
the type mentioned above in which it is necessary to aearoh and frequently 
alter the constructed file. An example of this problem is the tabulation 
of symbolic addresses and literals in assembly and oompiler programs* 

Another illustration is the frequency counting of words in a text, a oomaen 
procedure in the information retrieval field. 

Let M be the total number of items in the given list, and let I be 
the number of distinct ltens in the list. If the main list is sorted and 


fails problem is oonsidered in more detail in Ref* 1* 
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the duplicate itens identified and removed, approximately M log^ M 
operation* (i.e., ooapariaona and/or transfers) are required. (The 
IBM Fortran Assembly Program (FA?) uses this systea to for* its «ynbol 
table. ) Instead of sorting, the aain file nay be exaalned iten by iten 
and a file of distinct itens constructed. If the constructed file is 
maintained in soae preassigned order (to reduoe the tine required to 
test if a given iten has occurred already), an approxlnate upper bound 
on the muaber of operations is MM/3> (FAP uses this system to form its 
table of literal*. 2 ) If the ooastruoted file is not kept in order 
(thereby increasing the search tine, but decreasing the tine required 
to add an iten to it), an approxlnate upper bound on the nuhber of 
operations is M (log^ i ♦ ^). Finally, if the tree structure proposed 
below is used to construct the list of distinct itens, the upper bound 
is approximately ^(M + M **S(log£ N) operations. 

Thus, the tree procedure is significantly nore efficient when 
there are relatively few distinct itens. For example if there are 100 
distinct itens in a file of 1000 itens, the nuaber of operations for 
the four procedures are in the ratio (5>16>7tl):. Hence, if the tree 
procedure is not too conplex (so that one tree "operation" is comparable 
to a sorting "operation"), it nerits consideration for that class of 
problems involving files which are both searched and altered. 

2• Definitions 

The underlying principle of aost search techniques is to 
partition the aaln file into several snail subfiles and to select one 
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aubfile for further cerutlny. 4a the partitioning and aoloetion prooaaa 
nay ba llluatratad and axplalnad In taraa of traa atruoturaa, it la 
oonranlant to eolloot all doflnitloaa of taraa aaaoeiatad with troea. 

Moat of theae dafinltlana here baas adoptad from Ireraon. Sereral baaio 
daflnitiana ara llluatratad la fig. 1. 

4 trapb ooaprlaaa a aat of aodaa and a oat of onllataral aaaool- 
atlana apaalfiad bataaan pair* of nodaa. If noda 1 la aaaoeiatad with 
node J f the aaaaeiatian la oallad a branch from Initial node 1 ta terminal 
mala 4* 4. noth la a. aaqacnae of branahaa aaah that the terminal noda of 
eaoh branah aaiaoidaa with the initial noda of the aueeeedlag branch, 
lade j la xuakittt frcm node 1 If there la a path fron node 1 ta node J • 
The anaber af branahaa la a path la the length af the path. 4 la 

a path la which the Initial aada ooineidea with the terminal made* 

4 traa la a graph which ccatalaa aa eireulta and haa at meat ana 
breach cateriag each node. 4 root cf a tree la a node which haa no 
branohea cateriag it, and a leaf la a aada whloh haa no branahaa leering 
it. 4 root is aald ta lie aa the flrat laral of the tree* and a node 
whloh Ilea at the and af a path af length j -1 from a root la cm the jth 
laral . The act of nadea which Ua at the and af a path af length ana 
from node z arc aald to ba goramad by noda z and oompriae the nodaa cf 
the anbtrac rooted et noda z. 4 gfella la a traa whloh haa at moat one 
branch leering each node. 
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3* Binary and Sarial Saarohaa 

If a fila of M itaaa la atorad in a random aooaaa aeaory with 
tba itaaa arranged ao that thair kaya ara in asoending order, a binary 
aaareh aay ba uaad to looata an itaa in a tine approximately proportional 
to logj 1* The binary aaareh begina by teeting firat the itaa vhioh ia 
at the aidpoint of the fila. 1 ooapariaon dateralnaa whether it la tha 
daairad itaa and, if it ia not, tha ooapariaon apeolfiea in whioh half 
of tha fila tha daairad itaa lies. Thia half ia than biaeotod and, if 
aeooasaa yy tha- q ua rt e r of tha fila c o nt a ining tha d a ai ra d itaa ia 
determined. Tha biaeetion prooeaa eontinuaa until tha itaa ia loaatad. 

Tha binary aearoh proeadoro ia oonraniaatly depleted by a tree 
ia whioh eaeh node rapraaaata a fila itaa. Tha node on tha firat tree 
level oorraaponda to tha itaa at tha aidpoint of tha file) tha two itaaa 
an tha aaooad laral oorreapond to tha itaaa at tha ana-quartar and throe- 
quarter pointa of the file) ate. Xxeapt in tha laat one or two levels, 
two braaohaa aaaaata from aaeh node) whioh of thaaa branohea ia followed 
ia dataraiaad by tha ooapariaon of tha daairad itaa with the itaa as ao el¬ 
ated with the node ia quaation. Salooting one branoh obrioualy allainataa 
half of tha remaining candidate itaaa. 

Ivaaplo i To clarify aoaa of tha prinoiplaa introduced, aa 
exaaple of aa automatic dictionary will ba uaad. Tha key for suoh a 
la an lngliah word and tha information value of tha key ia ita equivalent 
in a foreign language. Tbr purposes of apaoifio illustration tha Mnglish- 
Oormain section of “The Maw Cassell's Carman Dictionary*^ ia used. 
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Figure 2 shows ths aeaory nap of a fils of 38 Bnglish words 
arranged in alphabetical order and stored in looatione 301 to 338 of 
a randea aoeesa aeaory. The binary search for the word * wallah 11 
prooeeds as follows: Coaparison with the itea "waist” at the Midpoint 
of the file indicates that "wallah" is in the lower half of the file 
beeaase "wallah! 1 is below "waist" in the alphabetioal order* The item 
in the center of the lower half is "wallop;" coaparison with it 
indieates "wallah" is in the upper half of this subfile (l.e., in the 
third quarter of the sain file). Another coaparison, this tine with 
"walk" indicates that "wallah" is in the lower half of the third 
quarter* The fourth coaparison then locates the desired itea. 

The tree representation of this binary search is shown in 

Fig. 3. 

The binary search requires that the iteas be arranged in 
increasing order in consecutive locations of a randea access aeaoxy* 
Although the expected search tine for this arrangenent (logg I) is 
relatively snail, the tine to alter the file by adding (or deleting) 
itens is proportional to N because aany iteas nust be moved to aake 
rooa for the new itea. The tine to alter a file may be drastically 
reduced by ohalnlng the iteas together instead of storing then in 
consecutive aenory locations. With each itea in the chained structure 
is stored the location of another itea of the file. Thus, the addition 
of a new itea is slnply accomplished, because the chain may be broken 
at any convenient point and then relinked with the new itea inserted. 


The tine to searoh such a file is, however, proportional to I (whether 



301 WABBLI » IACKXLI 

302 WAD « B0ND1L 

303 WADDL1 a UT8CHXLI 

304 WADI a MTU 

305 WAF1R a MFRL 

306 WAFFLI a 1A7TXL 

307 VAR » F0BIW1HH 

308 MO > VXDZLI 

309 V40B a LOU 

310 lion « on* 

311 MQGXRI a SPAS8 

312 MOOLI > MCXBLI 

313 V40Q8 a USTM0H 

314 I40T4IL a 8TXU1 

315 MIT « 7XBI4HKL06TX8 COS 

316 OIL a JAMBBI 

317 OH a MOB 

318 MH8C0T a TAFILUB0 

319 OIST a T1ILL1 

320 MIT a MHIU 

321 WAIT! - Aoron 

322 MO « MCOi 

323 V4LI a STREDH 

324 »*t.r ■ Q0 

325 OLL a MOB 

326 WALLABX a KLUOB RAI00BDH 
•327 OLUH a B0BSQHB 

328 MLL1T a BBI1RA8CB 

329 WALLOP > VOCHT 

330 WALLOW > SICH WALZB 

331 WAUDT a VAUOSS 

332 WALROS - WALB0SS 

333 WALTZ a MLZIB 

334 OMPUH a MUSCHUOXLD 

335 WAM a BLUCH 

336 WAMD a EOT* 

337 WAEDXR a VAMDStl 

338 WAW1 * ABRHMBI 


A LI it log of 38 Word* with the Route of the Binary Search 
for the Word "WALLAH" 

Figure 2 




WABBLE 


WADDLE- 


WAFFLE 


WAGGERY 



WAINSCOT 

WAIT 


WALL 


WALLET 

WALLOW 


WAN 


WANDER 


WANE 


The Tree Which Is Inplicd by a Binary Search o t the Pile of Fig. 2 

Figure 3 
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or not tho order is Maintained), beoauee only one other iten is aooeoaihle 
fro* any given iten. Henoe the search mist proceed serially, iten by iten. 
The tree representation of this aerial searoh reduoes to the trivial ease 
of a chain. Also trivial ie the new subfile partitioned at eaoh step; it ie 
aerely the previous file less one iten. 

Ixanpls ; figure i is a representation of that portion of random 
aoeese aenory containing the file of the 38 words of fig. 2 in a chained 
allooatlon. The words are not arranged in any order with respeot to the 
aenory locations, but the alphabetie order ie naintained by the ohain. 

To retrieve the word “wallop,* first the word “wabble" at the start of 
the ohain (its loeation 312 ie prespeeified) ie tested. As it ie not the 
desired iten, the next word of the chain, "wad," ie tested) the loeation 
of "wad," 323, is given as part of the data of “wabble.“ The cha i ni n g 
link of the word "wad" indioates location 313 is to be tested next. Thus 
the itene along the ohain are tested until the desired iten is found, 
figure 5 shews the tree representation of this serial searoh. 

To add the word "waiter* to the chained allocation, the data for 
"waiter" ie stored in an available location (339), the ohain broken.after 
"wait," and the new iten inserted. Thus after inserting "waiter,” lo oat ion 
317 contains 

VAIT > VASTXK 339 

and looation 339 contains 


■AlTHt « KXLLUR 323 
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301 

WADE 

116 

302 

WALRUS 

319 

303 

WAG 

332 

30U 

WAN 

_329 

305 

WAINSCOT 

318 

306 

WAGTAIL 

311* 

307 

WAMPUM 

301* 

308 

WALLOP 

233 

309 

WALLET 


310 

WAGGLE 

32k 

311 

WANE 

* 

312 

WABBLE 

336 

313 

WADDLE 

301 

311* 

WAIF 

.331 

315 

WALNUT 

302 

316 

WAFER 

320 

317 

"WAIT 

323 

318 

WAIST 

317 

319 

WALTZ 

~WT 

320 

WAFFLE 

330 

321 

WAIL 

337 

322 

WAGGERY 

310 

3 23 

WAIVE 

326 

321* 

WAGON 

306 

325 

WANDER 

311 

326 

WAKE 

336 

327 

WALK 

321 

328 

WALLAH 

309 

329 

HAND 

325 

330 

WAFT 

303 

331 

WAIL 

33u 

332 

WAOE 

335 

333 

wallow 

' 315 

331* 

WAIN 

305 

335 

WAGER 

322 

336 

WALE 

327 

337 

WALLABY 

320 

338 

WAD 

313 


A Memory Map of the Chained Allocation 
of the Words of Fig. 2 


C 


Figure 1* 
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WABBLE 

WAS 

WADDLE 

WADE 

WA7EB 


I WANDER 
WANE 

The Tree Replied by a Serial 8earch 
Figure 5 

To add "waiter" to the ordered listing (Fig. 2), however, requires 
moving all words from "waive" to "wane" down one location in memory before 
Inserting "waiter" in its proper location at 321. 

Summarising, it is seen that the ordered arrangement with a binary 
search is efficient for a file which is frequently searched and infrequently 
altered, and the chained arrangmnent efficient for a file which is 
frequently altered but infrequently searched. If it is necessary both to 
search and to alter the file, neither arrangement is attractive and another 
may be preferable. 

The tree allocation, described in the following sections, is a com¬ 
promise arrangement which utilizes the effective partitioning of the file 

) 
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found In the binary search and vhloh chains itana together for 
simplicity of organisation and alteration* It *111 be shown that 
the tree allooation nay be both searched and altered with tine 
proportional to s log 0 I, where s is a paraaeter associated with the 
tree structure. It is useful both when the file is ooapletely stored 
in randou aeoess neaory and when the bulk of it is in a dlso or drun 
neaory* 


4* The Tree Allooation 

In a tree allooation the partitioning of the file is aoooaplished 
by breaking the key into several disjoint parts* laeh part or eleaent of 
the key is aade to oorrespond to a level of the tree. The first tree 
level lists all possible values of the first eleaent of the key* Vith 
eaoh of these elenents is associated a list of those seoond elements 
which nay be used in ooablnation with that first eleaent. A ooaplete 
list of all possible seoond elenents is not neoessary for sene of then 
will never be used* 

Example i A natural way to break an lnglish word into several 
parts is to partition it into its component letters. Then the first 
tree level will have 26 nodes* one for eaoh letter of the alphabet. 

Vith eaoh letter is associated a list of letters which nay be used 
with it as the start of an lnglish word. Thus with the letter a w" on 
the first level is associated a list of the following letters on the 
seoond level: a* e, h, i, o, r, and y. The letter a k, a for example* 
is not included in this list as no English word starts with the pair 
■wk* a 
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The two-level tree nay partition the flit into enough subfile a 
so that any one of then may be conveniently searched serially. Them 
with each second-level tree nods, i.e., with each pair of eleaents, is 
associated the blook of iteas governed by the partition iaplied by the 
pair. However, if the lumber of eleaents in each subfile is still so 
large that an efficient search is not possible, the tree nay be extended 
to include aore levels. In this case, a list of those key eleaents which 
aay be used with the pairs corresponding to the second-level nodes is 
aseeeiahed with each sec o n d -level tree node. Olearly the tree aay be 
extended in this Banner through as aany levels as desired, and one part of 
the tree aay contain aore levels than sons other part. By varying the 
amber of levels in different parts of the tree it is possible to nake the 
amber of iteas in each subfile nearly unifora, if this is desired. Indeed 
it is possible to extend the tree levels so that each subfile consists of 
only one itea. The remainder cf this section describes a procedure by which the 
number of levels and the subfile sises aay be chosen so as to ainlalae the 
expected searoh tiae. 

Bxsaple i There are about 14,000 Snglish words listed in the 
Ingllsk-Qernaa section of Cassell's dictionary. The partitioning 
imposed by a one-level tree is shown in Fig. 6 , where it is seen that 
•w" governs 355 iteas. Partitioning on the first two letters breaks 
the •w* portion of the file into the subfiles of the following sises 
(see Fig. 7)t 

wa 75 wh 70 wo 45 

we 55 wi 78 wr 31 

wy 2 . 
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A 

755 

N 

270 

B 

775 

0 

355 

C 

1335 

P 

1170 

D 

815 

Q 

100 

E 

515 

R 

81*0 

F 

680 

S 

1695 

0 

395 

T 

785 

H 

1*80 

U 

615 

I 

535 

V 

200 

J 

100 

V 

355 

K 

80 

X 

15 

L 

51*0 

Y 

50 

M 

675 

Z 

30 


Distribution of Initial Letters of a Sanple of 
lit,000 English Words 

Figure 6 

The subfile associated with n wy n is certainly an all enough to be 
conveniently searched on an item-by-item basis • However, the other 
subfiles associated with n w n would require an item-by-item search which 
is considerably longer and perhaps should be partitioned on the third 
letter. Figure 8 shows part of a four-level tree for this file. 

The search for a given item in a tree allocation is conducted 
by scanning the set of nodes on the first level until the element which 
matches the first element of the key is located. Then one proceeds to 
the set of elements associated with that node on the next level; that 
is, to the filial set of that node. This set is scanned until the second 
key element is matched. Its filial set is located and scanned similarly. 


( 





letter 



Distribution of Second and Third Letters When Initial letter is W 

Figure 7 
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This process continues, building • path through tho trot to tho final 

hloek of key eleneats* This block ia Just tha filial aat of tho last 

noda of tha path; it difftro froa tha othar filial aata in two ways* 

first, it has aaaooiatad with each of lta nodes, not ona alaaant of tha 

key, hat all of tha alaaanta of tha key not already aaaaoiatad with tha 

nodaa of tha path landing to it* Second, aaoh of tho aodao of tha final 

filial sot dots not hart a filial sat of its own, hat rather indioatss tho 

lafOsaabion walna of tho key whioh tho noda represents* 

There are soworal ways in whioh a noda nay bo joined to its 

filial sot and tho nodes within a filial sot nay bo kept together* ▲ 

▼ary convenient technique is to chain aaoh node to its filial sot and to 

chain tho nodes within oaoh filial sat together. This arrangement is 

x 

called a doubly-ohained tree* Using this net hod, aaoh tree nods is 
represented by one ooaputer word* Tho ooaputer word is divided into throe 
fleldsi tha first indicates tha key eleaent walna of tha node, tho seoead 
oontains the address of another noda in tho filial eet of whioh tho given 
node is a nenber, and the third oontains tha address (of tho first node) 
of its filial set.* (Sea rig. 9.) 

3 

Iverson dasoribos a ainllar arranganont called a filial-hair chain 
representation. Johnson^ also discusses this arrangeaent. Tho tree 
is also oleaely related to list structures .* 

*A singly*ohained tree ia also possible. Tha nodes of a filial set 
are stored in oonsecutive aeaory locations, instead of being chained 
together* Than aaoh ooaputer word contains a value field and only one 
address field. 
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Example! Figure 10 shows ths actual contents of memory for the 
doubly-chained allocation of the tree in Fig. 8. 

The filial set is scanned by following the chain of addresses in 
the second field of the computer word and comparing the given key element 
with the key element values of the nodes of the filial set. When a match 
is found, the next tree level is reached by branching to the address given 
in the third field. Thus if there are s nodes in the filial set, the 
expected number of chaining links required to find a match is |{s + 1): 
one link to reach the filial set and ^(s - 1) to search it. 7 *' 

It is easy to add an item to the doubly-chained tree allocation. 
The tree is entered assuming the item is in the file, and the path for the 
key found. At some point a key element will not be found in an existing 
filial set. This new key element is then added to that filial set by 
breaking and relinking the filial set chain. The double chaining feature 
permits the use of any available memory locations for the new tree nodes. 
Moreover, this feature allows an item to be added to the file in roughly 
the same time that is required to locate an item .& 

In many applications the file is so large that it wLll not fit 
into fast random access memory but is stored on slower media such as discs 

^It is assumed all nodes in the filial set have the same probability of 
being selected. If the probabilities differ, the nodes should be tested 
in order of decreasing probability for greatest efficiency. An easy way 
to do this is to arrange the nodes in the order of the number of items 
they govern. 

^In a singly-chained tree the new tree nodes must be added at a location 
adjoining the locations reserved for their filial set. To do this may 
entail relocating the entire filial set. 
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Map of tho Troo of Pig. 8 
Flguro 10 

i 

* 

S 

| 

i 

§ 









or dross* A file stored on a disc or drun is physically subdivided by 
the traoks of the diso or drua into subfiles, one subfile par track* 

In this oase the tree structure say be thought of as a transforaation 
froa the key to the traok address of its assooiated data* The subfile 
corresponding to this traok aay then be scanned for the desired itea on 
an itea-by-ltea basis or transferred to the randoa aooess aeaory for 
scanning by the binary search. 

5* Mlnlalsation of the Xapected Search Tine 

Assuming the searoh tlae is proportional to the nuaber of 
chaining links traversed, the expected searoh tlae aay be oalouleted if 
all of the filial set aises are known. However, it is inoonvenient to 
use the set of all filial set sixes in aacrosoopio calculation* For 
computational purposes an average filial set else for each tree level is 
defined ast 

_ nuaber o f nodes on level 1 
*i nuaber of filial sets on level i* '■ 

Sinoe the average tine to searoh a filial set on the ith level is 
jgCe^ ♦ 1), the expected tlae to searoh an h-level tree is 
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Thus the expected search time, for a filo of V items allocated u 
on h-level troo with blooks of average also B is 

^ * 5 ^ (» A ♦ 1) ♦ t(B), (3) 

1=1 

short t(B)is tho expected tlae to soaroh ono blook. 

Iquation (3) girts tho oxptotod search tint in terns of tho 
parameters s^, h, and B. These parameters are related by the expression 

h 

B j| s i * M, (4) 

lal 

beoanse T^fs. represents the nuaber of nodes on the hth level/ which 
i=l 1 

is just equal to the nuaber of blocks. Frequently when designing an 
allocation and search system for a particular file the user is free to 
▼ary one or aore of s^, h, or B under the constraint of Iquetlen (4) 
to achieve an efficient ayetea. Several such situations are discussed 

below. 

CASS At In aaay data processing problems the file fits within the 
random access memory, and the data processing requirements are such 
that the key elements may be manipulated at will as long as the proper 
response is reoeived from a given query. The keys may then be considered 
to consist of a single string of binary digits rather than several disjoint 
elements each of which consists of several binary digits. For example, the 

rfehls assumes there are no leaves on levels 1,2,...,b-l. 
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key "CAT* is considered as tbs binary string *010011010001110011“ ratbsr 
than ths sat of distinot elements “C," "A," and *T,* or equivalently, tba 
distinct elsasnts *010011," "CQ.0001," and "110011.* With keys of this 
format, ths binary digits nay bs grouped to give ths nost efficient search 
system) that is the s^, h, and B nay be selected without constraint (other 
than (4)) to aininise t. It is shown in the appendix that ths minimum ^ 
is achieved wbem 

(1) all paths fron a root to a leaf have the sane length, hj 

(2) all filial sets have the sane nuaber of aeabers, s, that 
is Si = *2 * • • • 3 s fa * S) 

(3) the nuaber of elements in a block, B, is the sane as the 
ooaaoa filial set sise, that is B = a; and 

(4) the coaaon filial set sise s is 

The results (1) through (4) fix B as equal to s but related s, 

L. 1 

h, and M only by the relation a = I, so that either s or h aay be 
selected arbitrarily. Another analysis, also in the appendix, shows 
that the s, which satisfies this constraint and also alnialses t, is 
3*6 nodes per filial set. 

CASB Bt Another class of data processing problems has characteristics 
similar to those of Case A, except that the file is too large to fit 
within the random access memory and is stored instead on a drum or disc 
memory. Here the block sise B is normally determined by the number of 
items, T, which can be accommodated by one track, and not by considerations 
which minimise the over-all search time * for these cases the tree acts as 



a transformation froa the set of keys to the set of trsok addresses, 
and 1 My be minimised by varying the s^ and h. It is shorn in the 
Appendix that when B is fixed at T the nlnixua is achieved when 

(1) all paths froa a root to a leaf have length h; 

(2) all filial sets have the sane nuaber of members, s| and 

b 1/h 

(3) the ooaaon filial set sise s is (^) . 

As in Case A, absolute values for a and h are not fixed by these results, 
and it can be shown that the optima s is also 3*6 nodes per set, 

CASX Ct For sons applications it aay be inoonvenient or unnatural to 
consider the keys as single strings of binary digits; rather the keys 
mat be considered to consist of several distinot elements. If the 
nuaber of eleaents per key is a constant, h, for all iteas of the file, 
or if it is desirable to fora a tree using h levels, the analyses of 
Casee A and B apply, assuaing h to be fixed rather than variable. Thus, 
the ainiaua search tiae is achieved when the filial set sises are all 
equal to the optlaua value of either or according as 

the file fits within randoa acoeas aenory or the file is stored on a 
disc aeaory. 

The eptiaua tree allocation for these three oases requires that 
all paths within a tree be of the saae length. Frequently,however, it 
is profitable to vary the path lengths within the tree, stopping the 
branching in eaoh path at soae optlaua path length. For the case when 
the aain storage is a disc, the proper path lengths are easily determined 
beoause eaoh leaf mst govern T itea. Thus, if a particular node governs T 
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or fewer iteas, tho branching along that particular path aay bo stopped* 

If, however, tho nodo governs aoro than T itoaa, tho branching aost bo 
continued for at least one aore level* 

If the file fits within the randoa aoooss aeaorjr, the optiaua 
path length determination is based on the naaber of iteas, g, which a 
particular node governs and the naaber of nodes, s, in its filial sot* 

A detailed analysis of this optialsation is presented in the appendix) 
tho naln result is that in aost eases the optiaua length of a path is 
achieved when the branching is discontinued at the first node which 
governs fewer than six iteas* 

In Cases A and B in which it was possible to aanipulato the keys 
to deteraine a ainlaua expected search tine, it was found that the 
ainiaua was achieved when all filial sets had the saae nuaber of aoabers, 
s, and the optiaua s was 3*6. Substituting into Bquation (3) it can bo 
shown that (see Appendix) 

l = log, M (5) 

where H is either V or | aooording as the file is stored in the randoa 
aoooss aeaory or on a disc aeaory. Letting s take on its optiaua value 
of 3*6, one finds 

t * 2*3 logj^ 1 3 1*24 log 2 ft. (6) 

That is, the expected search tine in the optiaua oase is only 24? slower 
than the binary searoh tine. Moreover, as was indioated in the previous 
section, the tlae to add or delete an iten to the file is approxlaately the 



mm a* tha Maroh tine, a considerable improvement orer the file 
allaoatad for binary searching. 

It ia not always naoaaaary to manipulate tha kaya ao that 
tha oonstruetad elements ara completely different froa tha natural 
elements, although in ganaral any natural partitioning of tha kaya 
(e*g.» partitioning Bnglish words into thair ooaponant lattara) will 
not load to filial aata of tha optiaua sisa* If tha sises, a^, ara 
aaallar than tha optiaua, two or aora traa lawala aay ba coabinad to 
a aingla lawal with aora nodaa par filial sat by oonaidaring two or 
aora kay alaaanta to ba a aingla eleMnt. Conversely, if an s^ ia auoh 
largar than tha optiaua, ona traa lawal aay ba split into aewaral 
lawala by factoring tha corresponding kay eleaent. Tha siapiest way 
to aoeaapllsh tha faotorisation is to taka tha binary representation 
of tha eleaent in aewaral pieces of a few bits each, e.g., consider a 
six-bit eleaent as two pieces of three bits each or three piaoas of 
two bits* factoring in this way should aaka the filial sets 
relatiwely constant in sisa, not only froa lawal to lewel,but also 
within tha ssaa lawal* 

Kxaapla t figure 11 shows a tree in which two lawals haws bean 
combined into a single lawal by considering the second and third 
letters to ba a single kay eleaent. 

figure 12 shows how ona lawal aay be split into two lawals by 
factoring tha binary representation of the kay eleaents* 

Tha effloient search tiaa of tha traa structure noted sbowa is 
aohiewad by using a relatively small filial sat sisa. Small filial sets 
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in turn Imply relatively larger mounts of random access storage locations 
to accomodate the tree. Specifically, the total number of nodes in a 
tree of h levels, assuming there are leaves only on the hth level, is 

h J 

-ZTTv l7) 

J-l i-l 

U 

If it is assumed that s^ are all equal, s - N, and (7) becomes 

* • Z * J ■ * rh"- < 8 > 

J-l 

Equation (6) shows that as s increases, the number of storage locations 
required decreases. 

Thus an efficient search time is achieved with relatively mail s, 
but fewer storage locations are needed with relatively large s. To achieve 
a balance between these conflicting criteria, the measure of efficiency may 
be taken to be the product of search time and storage capacity. This 
measure is directly related to the cost of the system for it reflects both 
the amount of equipment and the length of time the equipment is in use. 
Using (5) and (8) one finds that the cost C is 

c " 5 < Ht ) ( “ h " lh (9) 

The cost C achieves its minimum value when s - 5.3• 

The curves plotted in Figs. 13, Hi, and 15 show, respectively, 
the expected search time as a function of s (Case A or B), the cost 
as a function of s (Case A or B), and the search time as a function of s 
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and g (Cm* C) • ill eurves ara normalised ao that tha minimum values 
art unity. In aaoh oaaa thara la a ahallow minimum, indicating that 
it la net tee. expansive in tarn* of tlaa or coat if tha average filial 
set sise varies fron tha optima value* Thu a, for example, a spaed 
daeraasa of less than 20£ from optima is observed if the actual filial 
set sise is between 2 and 8 nodes, and, similarly tha cost lnoreases 
lass than 204 from optima if tha actual sise is between 3 and 12* 

In suamary, therefore, if it is possible either to select or 
to manipulate the aisea of the filial sets, they should be chosen to 
be in the range of 4 to 8 nodes per set for most efficient operation* 

In this case each path from root to item will have the same length* 

If it is not possible to ohooae the filial set sise, the most efficient 
operation will be achieved when the path lengths vary, and the optimum 
path length is determined by terminating the path at any node which 
governs six or fever items* 

6* Multidimensional Indexing, Tries, and Trees 

Multidimensional indexing techniques^ provide a means for 
partitioning a file into subfiles which is simpler than the tree 
structure and also permits more rapid entry to a subfile than the 
tree* in h-dimenaional Indexing arrangement is essentially an h- 
dlmenaional array of sddreasea* Blenent (i^, ...,i h ) of the array 
indicates the address of the subfile composed of those items whose 
first h key elements are key element i^, key element lg,..«, key 
elaaMKt i^. Such an array requires n^x n^X ***X a B computer words 
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of storage (short is the number of elements In the jth key position). 

In most data prooossing applications, many of the entries are not used 
and, henot, are wasted. Moreover the else of the subfiles is not unifora, 
thereby increasing the expected search tine. 

In h-diaensional array aay be loosely thought of as an h-level 
tree with the jth diaension corresponding to the Jth tree level. Thus, 
the h-diaensional indexing techniques is roughly equivalent to an h-level 
tree. In this tree eaoh node possesses a ooaplete filial setj that is, 
all filial sets on the Jth level have aj eleaeat*. This is equivalent 
to assuaing that eaoh key eleaeat may be followed by every key eleaeat of 
the next level, no elision taking plaoe beoanse of noaooourrenee of eleaent 
pairs. See Tig. 17. 

Mxaaple i A three-diaensional indexing teohnlque for the Inglish 
dictionary requires 26 x 27 x 27 * 18,954 looatlens. The else of the 
subfiles varies greatlyt eapty for aany letter triples) one itea for 
■aja," 113 for a int. a The equivalent tree has 26 nodes on the first 
level and 27 nodes in every filial set of the tree. Thus nodes suoh as 
■q* (whioh needs only one node in its sib set) and a wk a and a waq a 
(whieh need no filial sets) all have filial sets of 27 eleaents. 

An obvious modification to this tree structure is to eliminate 
the filial sets whioh are never used. That is, if one or more nodes of 
a filial set are used as parts of iteaa, that ooaplete filial set is 
retained in the tree) if none of the nodes are used, that filial set is 
removed. See Tig. 17. lotiee that there still are nodes remaining in 
the tree whioh are never used) if these nodes are also reaoved, the 
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structure reduces to the troo discussed is tha preceding parts of this 
paper. Saa Tig. 17. In aosa oases, however, tha retention of thaaa 
unused aodas is useful, baoaasa all filial sats in tha traa ara ooaplata 
and may bs conveniently antarad by slapla isdaxing, eliminating tha 
node-by-node scanning of tha sat required for incomplete filial seta. 

This modification which remoras unused filial sats is essentially 

tha traa organisation described by FredkLm and dubbed a trie (re trie val). 

8 

The University of California at Berkeley uses a trie (although they do 
not call it a trie, or area a traa) for their language translation 
dictionary of 20,000 words. 

Tha representation of a trie in a oomputer is similar to that of 
a tree. As before, one computer word is made to correspond to ons troo 
node, and tha nodes of ons filial aet ara represented by oontinguous 
oomputer words. Tha oomputer word need contain only one field, the 
chaining address to link the node to (the first word of) Its fil i al sot. 
Tho key element value of the node need not be stored as its value is 
implicit in the position of the node within the filial sot. 

Example; Figure 18 shows the four-level trie which oorreepoads 
to the tree of Fig. 8. Mote the filial set of the node a w* contains the 
oomplete alphabet, but only the nodes "wa,* "we," H wh, a eto. have filial 
sets themselves. These filial sets also contain the complete alphabet. 
Figure 19 shows the memory map for the trie of Fig. 18. 

Betrieval from a trie structure requires essentially only one 
Indexing manipulation for eaoh level of the trie; thus the expected 
searoh time is proportional only to the average path length, h, rather 




A Four Ltvd. Trio 



m-fco 



ft# Storage Map of th# Trie of Pig. 18 
Figure 1? 







thanks ♦ 1) as in the tree struotur*. This speed advantage la 
ooapanaatad for by requiring aora atoraga location*, however. To 
aatlaata tha nuaber of locatIona, lot n ba tha (average) nuabar of 
alaaanta poaalbla in aaeh kay element position and a ba tha (average) 
nuabar of alaaanta utlllsad in aaeh position. Tha tria raqulraa n nodaa 
on tha first level and n nodaa in aaoh filial sat of ovary other level. 
On tha avaraga, a of the nodes in each filial sat will have filial 
sets theaselves. Thus, the total nuabar of nodes in the tria is: 


n ♦ 


sn ♦ 


s^n + 


+ 




1 

T* 


do) 


A comparison of Equation (8) with (10) shows that the storage requirementa 
of tha tree and tha tria are in the ratio a/n/ 

(a/n) b nay be oonsidered to be the density of key words whioh 
actually occur in usage in tha sat of all possible kay words. For aost 
practical data processing applications (e*g., automatic dictionary, 
personnel, file, inventory records, etc.) it does not seea that the 
density is very high. 

Example: For the dictionary n » 26 for the first level and 
n » 27 for all other levels, s = 26 for the first level, also. 

Towels on tha first- level have filial sets with s near 26, consonants 
have s about 8 or 10; thus an average s for tha second level is about 
15* For tha third level, it appears that s is about 10, and for 

%inoa tha trie requires a shorter word length than tha tree (one 
address field vs. two (possibly one) address fields and one valna 
field), tha ratio 2*/n night ba aora appropriate. 
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subsequent levels s appears to be less than $. (Detailed analysis of 
these data has not been made.) Thus a/n for the dictionary considered 
is about 1/2 considering the first four or five levels. 

If s/n is near unity, the trie organisation is preferable to 
the tree as its search is much faster and its representation requires 
about the same storage space. In this case, however, the total number 
of storage locations required is greater than n^j h-dimensional indexing 
requires n h locations and its expected searoh time is less than that of 
the trie. Thus if sen, it appears that a multidimensional-indexing 
organisation is preferred to either a tree or a trie. 

If s is significantly less than n, the speed advantage of the 
trie is more than compensated for by the excess storage requirement, 
unless the random access memory can accommodate this excess. If s is 
quite small, it was suggested that two or more levels be combined to 
yield an s closer to the optimum value. Such a technique would be 
disastrous with the trie organization because the number of possible 
key elements in such a combination grows very rapidly (e.g., n for two 
levels), reducing the over-all density enormously. 

Thus the trie may be considered to be an organization intermediate 
to the tree and to multidimensional Indexing, and it may be considered 
to be a special case of each. If the density is almost unity, the multi¬ 
dimensional indexing techniques are superior; if the density is far from 
unity, the tree organization is superior; somewhere in between the trie 
may be used advantageously. 
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7. Summary 

The tree organisation of the keys of a larga fila has been 
reviewed. Sooh a structure was found to ba usaful aithar whan tha 
astira fila is storad within a randoa aooass nenory or whan tha bulk 
of tha fila is hald on a disc or drus saaory. 

The sost isportant oharactaristio of tha tree organisation is 
that it can ba both searched and altered efficiently. The expected 
branch and add-itea tines ware shown to ba proportional to s log # n, 
where s is the average filial sat sisa and n is either tha number of 
itens in tha fila for randoa aeoesa memories or tha number of traoks 
for a disc memory* The optimum a was found to be between 3 and 3, 
and, using tha optimum s, the expected search time was calculated to 
ba only 2% slower than tha binary search. 
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APPBMDIZ 

DZHIVATICdl OP MB HBSULTS FOR MB OPTIMUM BXPKTBD 
SB&BCH TUB 


Bquation (1) given tha definition of tha average filial aat aiaa 
for traa level i. Sinoa tha number of filial sata on level i ia juit 
tha maker of nonlaaf node a on laral i - 1, an equivalent definition. 
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which is sometimes mors convenient is 


_ number of nodes on level 1 

8 i " number of nonleaf nodes on level i - 1 


(AX) 


(and s^ ■ number of nodes on level 1). 

For Case A it is assumed that the blocks are searched on an 
item-by-item basis so that t(B) ■ |(B ♦ 1). Thus Equation (3) becomes 




1) 


♦ 2JL 

2 h 


♦ 1. 


i-1 


n 


(A2) 


From the symmetry of Equation (A2) in s^s^,.. .,s h it is clear 
that the expected search time is independent of the order in vhich the 
levels are taken and that the minimum search time is achieved vhen the 
s^ are all equal. The latter assertion is proved by setting the 
derivative of (A2) vith respect to s^ equal to zero. This yields 

h' 

•k n *i • "• <«> 

i-l 

By letting k - l,2,...,h one finds 

2 2 2 
S^Sg • • • 8^ — S^Sg ■ • • — • • • — fl^Sg • • • S|^ — N 

so that 

s^ - Sg - ... - s^ « (Al») 

This neans that the tine spent searching the block at the end of the 
tree should be the sane as the time spent searching a filial set. 
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That is* the block sis* should bo so that the satiro struoturs 

may bo viewed as a tree of h * 1 levels. Setting 

Lahti (45) 


(42) and (M) beeoaes 


and 


l * |(s ♦ 1 ) 


(46) 

(47) 


If the file is stored on a drua or dlso as in Case B and not in 
rondos aoeess memory, the block else, B* is normally deterslned by the 
number of items* T, whioh oan be acoommodated by one track sad not by 
considerations vhioh minimise the over-all searoh time. The tree acts 
as the transformation between the set of keys and the set of track 
addresses* 

In this oase B = T, and the tree has h levels* where h Is the 
largest k which satisfies* 



(48) 


Since B is fixed by an external constraint, it is safe to assaae 
1(B) is also fixed* so that the problem of minimising the over-all 
expected searoh time is reduced to choosing the s^ so that 

h 

j L <*i *« <«> 

i=»i 

is minimised* subjeot to constraint (48). 
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Again it is clear that the minima is achieved when 

»i 3 *2 " *** * •h * 

Them 

• = f(a ♦ 1) (HO) 

and 

/N xVh 

• * (I) • (m) 

Kuaticns (410) and (411) have exactly the same form as (46) and 
(47). from either pair one obtains 

l logjj M. (412) 

It is possible to determine a numerical value for the optimum 
filial set else from either pair of equations. Minimising 

t = |(e ♦ 1) (413) 

subject to the constraint 

• - ^ h > (414) 

leads to the requirement that 

• ♦ 1 = s In s 

which has a solution at s * 3»6. figure 13 displays (413) subject to 
(414) normalised to 1 opt 3 1. 

In Case C when the file fits within random aocess memory, the 
optimum path lengths are determined by the following argument. 

o 
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A path leading to a block of Items is considered to have its 
optimum length if the expected search time for items in the block is 
increased when the path length is changed. To determine the optimum 
length for a path, assume a node x of the tree governs g items and 
has s nodes in its filial set. Then the nodes of the filial set each 
govern an average of § items. If the branching is discontinued at node 
x, the expected search time from x is 

t d -|(g*D. (H5) 

If, however, the branching is continued for one more level, the expected 
search time from x is 


t 


o 



S> 


♦ 1 . 


(U6) 


Thus, the expected search time from x will decrease if the branching is 
continued when t, > t . 

u C 

t. > t when 
a c 

■ 2 «■ (1 - g)s ♦ g < 0. (A17) 


The zeros of (119) are at 

<r l» <r 2 


g - 1 


t-M- 

2 


6g » 1 


(118) 


Thus if g'» 6g ♦ 1 > 0, there is a range of (real) s for which adding one 

more level does decrease the expected search time. Conversely, if 

2 2 
g - 6g ♦ 1 < 0, there is no such value. The zeros of g - 6g ♦ 1 are 


at 3 * 2/2. 
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Therefore; the expected search tiae will be decreased if 
branching is continued from any node which has 


g 1 6 

(119) 


(120) 


Figure 15 displays (117). For any node with its (g,s) lying 
in the pie-shaped region to the right of the solid curve of Fig. 15; 
(119) and <120) are satisfied; that is; the search tiae can be 
decreased by continuing the branching froa that node. The dashed 
curves indicate the relative iaproveaent in the search tiae between the 
oases of continuing and discontinuing the branching froa the node. 

Fxaaple : See Figs. 6, 7, and 16. Conditions (119) and (120) 

are certainly satisfied for all nodes on the first level of the tree 
for Cassell's Dictionary. They are also satisfied for all the nodes on 
the second level of the *w" subtree except the "wy* branch; the "wy" 
node should not branch but indicate the location of the list "wychela,* 
"wyvern." Similarly (119) and (120) are satisfied for nodes "wre" and 
"wri," but not for the other nodes of the sib set of "wr"s node "wra* 
satisfies (119) but not (120), the others do not satisfy (120). 

Motioe that the arguaent leading to (119) and (120) involves 
adding one level at a time. In rare cases it night be advisable to add 
a level which increases the expected search tine if subsequently another 
level is added which decreases the over-all tiae. 

( 
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Xxmaple t The first level node "q* of the tree of Xngllsh words 
has e=l* Uenoe adding the second level, to fora "qu," increases the 
search tiae. Howevor, adding a third level (s 3 4s “qua," "que," *qui," 
and "quo") substantially decreases the search tiaeo 
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IV. AN INTERPRETIVE PROGRAM FOR CORRELATING LOGICAL MATRICES 

Michael Leek 


Several experiments in automatic content analysis have been 
conducted in which properties of a document collection were conveniently 
expressed by logical matrices. For example, a subject index may be 
represented by a matrix in which each row corresponds to an index term, 
and each column to a document. The documents associated with each term 
are then indicated by l's in the proper columns. Bibliographic citations 
may also be represented in this form* specifically, logical matrices were 
used recently to compute an index of relations between citations and 
document contentThe present report describes a computer program 
designed to manipulate logical matrices in accordance with the specifi¬ 
cations outlined in ref. 1. Ihe following operations are neededi matrix 
transposition. Boolean multiplication, and correlation of logical matrices 
Two types of correlations may be produced! row correlations, which 
compare rows of the same matrix, and cross correlations, which compare 
two different matrices. 2 

The program is written for the IBM 7090 computer, which has binary 
logic instructions permitting efficient storage and processing of matrices 
The number and size of the logical matrices to be processed is limited 
only by the space available in 7090 core (there is a normal limit of 25 
matrices, but this oan be raised easily as described in the appendix). 
Numerical matrices of row correlations may not be placed in core storage, 
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but are written on Intermediate tapes. Logical matrices are referred to 
at all times by their names, consisting of any five BCD characters (except 
five blanks). Bach matrix is assigned a unique name when it is read into 
storage or generated by the program. 

The input cards to the program consist of Instructions and (where 
needed) descriptions of matrices. The instruction cards include a pseudo¬ 
operation code beginning in card column 1, extending as far as necessary, 
and terminating with ~e blank column. The narass of any matrices or tapes to 
be used in the instructions are written following this blank column. 
Operation codes recognized by the program include the following! 

ASTMLOGIN NAJE1 - Asymmetric logical matrix input. This instruction 
causes the program to generate a matrix, named NAMEL, as specified by the 
cards following the operation card. "Asymmetric" implies that the rows 
and columns are not referred to by the same names, and has nothing to do 
with the symmetry properties of the matrix elements. Immsdiately after the 
ASXMLOGIN card a list of the names of every row of the matrix, called the 
"row identifiers," follows. Bach name consists of five characters (not all 
blank) and is followed by a blank. This permits twelve identifiers to be 
punched in one card, columns 1-72. As many cards as needed are used, the 
end of the list being signaled by six consecutive blank columns (on a new 
card, if need be). The next set of cards after this sentinel contains the 
"column identifier" list in precisely the same format (with the same ending 
sentinel), but specifying the names of the columns rather than the rows. 

The cards which describe the actual matrix follow the column identifier list. 
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Bach element of the matrix is specified by lta row and ooliuo ldantlfiara. 
Tba cards daacribing tbs matrix havs blanks in columns 1-6, a row identifier 
in columns 7-11, and a list of column Identifiers (with a single blank 
ooluan after each Identifier) in column 13-72. The program assigns a 
value of 1 to the matrix element specified by the row identifier and each 
column identifier on that card. Up to ten elements may be specified on 
one card. If a row has less than eleven l's in it, it is described by a 
single card with the row Identifier in columns 7-11, followed by the 
column identifier for each column in that row at which there is an element 
with the value one. A row with more than ten l's in it will require more 
cards to specify it; a row identifier may occur on any number of cards. 

If a row has no l's at all, no card is needsd for that row. The order of 
the colum identifiers on a card, or of the cards that specify the matrix, 
is immaterial, but all matrix specification cards must follow all cards 
giving the Identifier lists. The end of the specification cards is marked 
by a card containing something other than six blanks in the first six 
columa (this card will be the next pseudo-operation). A card in the 
matrix specification section with a row identifier not given in the list 
of row Identifiers which preceded the matrix specification will cause an 
error notation on the output copy. Aside from this notation, such a card 
will be ignored. If a card contains a colum Identifier that was not given 
in the column Identifier lists, an error notatioa will be made and that 
identifier ignored. The program will try to interpret the reminder of the 
card, however. Vi thin the lists of row and colum identifiers, each 
identifier must be given only once. If any identifier is repeated In an 





identifier Hat, the program will detect the eecond occurrence, sake an 
error notation, and ignore the repetition. 

SIMLOGIN NAHEL - Symmetric logical matrix input. This pseudo- 
operation reads in a matrix whoso rows and columns have the sams names. 

Only one identifier list, which serves for both rows and columns, is given; 
otherwise, the operation is identical with ASIMOGIN. 

LOOWRITB NAMS1 T - Write logical matrix on tape. There are two 
intermediate tapes, denoted P and Q. To write a logical matrix on one of them, 
this instruction is used with either the character P or the character Q re¬ 
placing T in the instruction. NAHB. is written on the tape as a two record 
file. 


LOGRBAD T NAMKL - Head logical matrix from tape. A matrix written 
by a LOOWRITB Instruction is read from tape (again, T must be either P or 
Q), and named NAMB1. If a matrix written by a LOOWRITB is not in position 
on the specified tape, there will be an error printout, and the operation 
will not be executed. NAMB1 may be omitted from the instruction; if no name 
is given, the matrix will retain the name it had when it was written on the 
tape. 


FREE NAME1 NAME2 NAME3 ... - Free storage areas used by NAMB1, eto. 
(operation terminates when six blank columns are encountered on the instruction 
card). Thia card tells the program that the storage being oocupled by HAMS1, 
etc. is now available for use by other matrioes. Mo further referenoes may 
be made to MAKEL, ete. An alternative method of freeing a matrix is to 
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insert sn F after Its name in any pseudo-operation that refers to the 
matrix. This causes the matrix to be freed after the pseudo-operation 
is performed. Thus, LOGWRITE NAMELF P is equivalent to LOGWRITE NIMBI P 
followed by FREE HAMElj either will cause NA>S1 to be written on tape P, 
and then freed so that the core storage it occupied becomes available for 
further operations. To decide when matrices must be freed, the following 
formula is usedt let R be the number of rows of a matrix, C the number 
of columns, and let N equal C /36 (add one if any remainder). Then the 
number of words of core occupied by the matrix is 6 ♦ C + R(N + 1). 

CR/36 + C ♦ R is a good approximation. Whenever the total space used 
by matrices would exceed 29,000 locations, matrices must be freed or an 
error printout will result. 

TRANSPOSE NAME1 NAME2 - Transpose logical matrix. NAME1 is trans¬ 
posed and the result called NAMB2. Note that during the execution of this 
instruction, the storage space required is equal to twice that for NAMB1 
plus the storage for NAME2. If the instruction is of the form TRANSPOSE 
NAME1F NAM2, however, the space needed is only that for NAME1 and NAMES. 

MULT NAME1 NAME2 NAME3 - Multiply logical matrices (Boolean 
multiply). Matrix NAME3 is defined as follows 1 its row identifier list 
is the row identifier list of NAME1, its column identifier list is the 
column identifier list of NAMtS, and its elements are defined by normal 
multiplication of matrices, with "multiply" replaced by "logical and" and 
"add" replaced by "logical or" throughout the definition of matrix 
multiplication. That is, if we call A the matrix named NAMB1, B the 
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matrix NilK, and C the product matrix NAME3, then ■ 1 If and only if 

there exists a k such that a£ and are both 1. The column identifier 

d 

list of NAME1 must be the same as the row identifier list of NAME2 for this 
pse udo-operation. 

MULTI NAME1 NAME2 NAMB3 * Boolean matrix multiplication by transpose. 
NAMB3 is defined to have the row identifier list of NAMEL, a column identifier 
list which la the asms aa the row iden t i f ie r list of JKAMB2) element « 1 
if and only if there is a k such that a£ and are both 1. Thus NAMB3 is 

the product of NAMB1 by the transpose of NAME2. It should be noted that 
MULT is performed by executing a TRANSPOSE followed by a MULTI, so that 
MULTI is faster than MULT. 

ROWTAHS NAME1 T - Row correlate NAME1, write the results on the 
output tape and on T. Row correlations are comparisons between rows of a 
matrix. The correlation factor, r, between rows I* and iJ of a matrix I, 
is defined 



where matrix I has m rows. ROWTAFR confutes the correlation for each pair 
of rows in NAME1 (unless a SELECT pseudo-operation has been executed) see 
that operation), and writes these correlations on the output tape and the 
tape specified by T (either P or Q). The output appears in a triple column 






format with 56 lines per page. Each output item consists of two row 
identifiers and the correlation factor) thus: 
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R0W13 ROWlh .1*937. 

The number of l's in each row of the matrix is also written on the output 
tape and is called the row sum. 

ROWTAPRSUPP NAME1 T - Row correlate, write on intermediate tape. 
This operation is the same as RO M TA I E except that the row correlations 
are not written on the output tape. The row sums are written on the 
output tape as before, however. 

ROWPRINT NAME1 - Row correlate, print (off-line). Same operation 
as ROWTAfB, except that nothing is written on an intermediate tape. 

LOGCROSS NAMEL NAME2 - Cross correlate NAME1, NAMK2. Cross 
correlations are a measure of similarity between sets of row correlations 
derived from two distinct logical matrices. For every row in a logical 
matrix we can form a set of row correlations of the rowj this set is a 
vector R*, where the elements R* are the row correlations of row i with 
row j. If two matrices have the same row identifier lists, we may define 
a corresponding 3* for the same row in the other matrix. Now, the cross 
correlation between row i in these two matrices is defined as 
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LOQCROSS computes these cross correlations from the logical matrices by 
first computing the row correlations for a row in each matrix, and then the 
cross correlation for this row. The row correlations, although computed as 
intermediate date, are not available as output. An "over-all" cross 
correlation is also produced; this is the quotient 

• 

LOGCROSS writes on the output tape the row sums of each matrix, the cross 
correlation for each row, and the over-all cross correlation. 

TAPECROSS - Cross correlate two matrices on tape. If there is a 
numerical matrix of row correlations on tape P, and another on tape Q 
(written by ROWTAHS and RGWTAH58UPP operations), and both matrices have the 
same row identifiers, TAPECROSS will compute the cross correlations for each 
row and the over-all cross correlations, and write them on the output tape. 

If tapes P and Q do not both contain numerical matrices with the same row 
identifiers, error printouts result. 

SELECT NAME1 - Selects row identifiers. If two matrices have different 
row Identifier lists, they cannot be processed directly by LOGCBOSS and 
TAPECROSS. If portions of the identifier lists coincide, however, and it is 
wished to correlate soma or all of the oommon rows, the SELECT pseudo¬ 
operation siay be used. It must be followed by an identifier list in ASIMLOGIN 
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format. The program will flag all row identifiers in MAMtl jjfii given in 
the list supplied aftar the SELECT card, and all such flaggad idantifiars 
will ba ignored during computation of row correlations. Thus if the 
unflagged portions of the identifier lists of two matrices are the same* 
they nay be cross correlated, although the flagged portions of the 
Identifier lists are different. 

BACKSPACE a Tj b • Backspace tape. Tape Tj (either P or Q) is 
moved backwards a files (one logical matrix or one set of row correlations 
is one file), and tape Tg is mowed backwards n files. If either m or n 
is blank, it is Interpreted as 1. If T 2 is blank, only one tape is mowed. 
If T 2 - T^, the second specification overrides the first. 

DATE iilililliiillli - The fifteen characters following DATE are 
copied onto the page heading. 

END - Stops the run. This should be the last card of any deck of 
instructions. If it is inadvertently omitted, the program will stop when 
an end-of-file appears on the input tape. 

As assembled, this program will run under the PQSTBAN monitor 
system on any 7090 with 32K storage and sufficient tape unite (two channels, 
five tapes on each). The program reads BCD unblocked records from A2 and 
writes BCD unblocked records on A3 as output. The intermediate tapes used 
are A5 (P) and B5 (Q). Output is to be printed PCj nothing is printed past 
position 72 on the page. 
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In a standard FMS system, the following cards are needed: 

(1) job ID card for FAS sign-on record; check Individual 
system requirements; 

(2) *XEQ (*in column 1; X, E, Q in columns 7-9); 

(3) binary program deck, in nonnal relocatable column binaxy 
format; 

(1*) *DATA (win column 1; D, A, T, A in columns 7-10); 

(5) instructions and other data, terminated by an END card. 
Scratch tapes must be mounted on tapes P and Q (a5, B5). 

(Note: To date it has never been necessary to use the pseudo-operations 
TAPKCROSS and SELECT. As a result, these two instructions cannot be 
guaranteed to operate correctly.) 


APPENDIX A 
PROGRAM PARAMETERS 


Some program parameters may be changed without excessive difficulty. 
Four particularly important cards in the FAP deck are (all near the beginning): 


Cols. 1-6 

Cols. 8-15 

Cols. 16 

LNADIB 

E9J 

50 

MUST 

EOS 

29000 

P 

TAFENO 

A5B 

Q 

TAPENO 

b5b 
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LNADTB is twice the absolute maximum number of matrices that may 
be kept in 7090 core; it is presently set for 25 matrices* MAXST is the 
main storage area, and determines the size of the storage area for matrices 
and other temporary storage (such as 1-0 buffers; none of these other 
storage areas is likely to be large). At the moment, the distribution 
of core storage is 


Inaeeeeeible lower *n& upper core areast 306 locations 
Main program, lower storage: 3,178 locations 
Subroutine EXIT, lower storage: 18 locations 
Subroutine SORT, lower storage: lil* locations 
Address table (LNADTB), lower storage: 50 locations 
Main storage area, upper storage: 29,000 locations 
Unoccupied storage: 172 locations 


Total 7090 core storage: 


32,768 


In changing the input and output tapes, it must be remembered that 
tape reading and writing is not done through (IOU) and reassembly will be 
required if the system configuration is changed from the version in the 
FORTRAN manual. The input tape is defined by a card R TAPENO A2; the 
output tape is defined when FD90UT is called to write a line of output 
((STH) is not used). To change the output tape, use the symbolic reference 
table in the assembly listing to find calls to OUT and OUTNC, and the 
SHARE writeup of FD90UT (with the comments in the listing) to determine 
how to alter the calling sequence to change the output tape. 
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APPENDIX B 
SAMPLE PROGRAM 

The sample program investigates the relation between citations and 
content as outlined in ref. 1. It uses two citation correlation matrices, 
CITNG and CNG2, and computes their cross correlations with the term 
correlation matrix TDCMP. There are two input matrices, CITED and TTCMP. 
CITED represents a citation index; if we call this matrix A, then Aj ■ 1 
if document j cites document i. TTCMP represents a subject index; calling 
TTCMP B, Bj » 1 if term i applies to document j. The program generates 
from these matrices the needed matrices CITNG, CNG 2, and TDCMP. It then 
computes the row correlations for CNG 2 and the cross correlations for 
both CITNG and CNG 2 with TDCMP. 

The program, complete with all descriptions of matrices, is given 
in the attached listing; the instructions are reproduced here: 

DATE OCT. 18, 1962 This instruction causes OCT. 18, 1962 

to appear on the top of each page of output. 

SYMLOGIN CITED Reads in CITED. The next six cards give 

the identifier list; matrix specifi¬ 
cations follow. Note that only one 
identifier list is given. 

TRANSPOSE CITED CITNG CITNG, the matrix of citations produced 

by transposing CITED, represents the actual 
bibliographies of the documents; each row 




MULTI ClTNC- CITEDF CNG 2 


ROWPRXNT CNG 2 


ASYMLOGIN TTCMP 


TRANSPOSE TTCMPF TDCMP 


corresponds to the references pertaining 
to one document. 

This instruction produces CNG 2, the 
second order citation matrix. In.CNG 2, 

Aj ■ 1 if there is a document k such that 
i cites k and k cites J. By reference to 
the definition of MULTI, it will be seen 
that CNG 2 is precisely the matrix produced. 
CITED is no longer needed and is freed. 

The row correlations of CNG 2 are computed 
and written on the output tape. They are 
not stored internally or on intermediate 
tapes and will not be available for future 
processing. 

TTCMP, the subject index, is read in from 
the input tape. Note that two identifier 
lists an given; there are fifty-six terms 
"and sixty-two documents, each with a distinct 
set of names. 

I 

TTCMP is transposed to get a matrix with 
the same row identifier list as CNG 2 and 
ClTNG. TDCMP, the transposed matrix, 
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represents a eat of descriptors; celling 
It A, ■ 1 if term j applies to document 
i. TTCKP is freed. 

LOGCB0SS TDCMP CTC Grose correlations are produced for TDCMP 

and CITNG. This is done by computing the 
row correlations for each row of each matrix 
internally, and cross correlating the row 
correlation vectors, row by row. A cross 
correlation for each row and an over-all 
cross correlation are produced. Row 
correlations are not written on tape for 
printing* 

LOGCROS3 CNG 2 T }X2MP Cross correlations are computed for TDCMP 

and CNG 2, as above. 

■TO Stops the run. 





V 

k 


IT-15 


CO403 

VC307 

F0409 

J0413 


DATE OCT. 16. 196? 
SYMLOGIN C1Tf0 
B0319 80407 BO408 CH504 
F0306 F0603 FR20* FR303 
G1491 Gf 495 GR202 GR204 
J0413 JO607 JO609 J061I 
HJ306 MJ310 MJ494 MJSO* < 
WA410 MA412 

B0319 C0317 F04li> 
B0407 GB604 KU60*> 
00406 BO407 PL60I 
CH504 

C0206 00319 C0317 
C0303 VS306 
C0317 FR406 FR61? 
. .£04O2-__..... • 

CO403 JO609 LV602 
F0409 IS4U J0413 
F0415 IS4U J0609 
F0496 JO609 
F0506 F0419 J0609 
F0603 PL60S 
FR20S 00119 CO403 
FR303 C0317 VC307 
FR304 C0401 F0409 
FR309 J0310 J0413 
FR40S CO401 
FR406 CO403 
FR612 
GB604 

GE302 C0317 J03J6 
G1209 B0319 J0314 
GI 491 CHS04 F0496 
61495 F0496 JO201 
GR202 PL313 
GR204 

GR313 C0317 F0409 

I S311 JO609 

I S312 FR304 J0316 

IS411 WA412 

J0201 B0319 B0407 

J0201 J0314 J0316 

J0310 C0317 FR309 

J0314 J0611 MA301 

J0316 CO403 IS411 

J0413 J0607 

J0607 

J0609 

JOA 1 1 

K11603 

LF606 

LY414 F0603 J0609 
LY602 F0603 J0609 
; MA301 J0314 SA401 
MG404 CO403 J0609 
MG493 F0496 GI49S 
MJ203 J0413 J0609 
MJ306 C0317 CO403 


C0306 C0317 C0402 C0403 F0409 F0415 F0496 
FR309 FR40S FR406 FR612 68604 GE302 GI209 
1531 1 I S312 IS4U J0201 JOJtO J0314 J0316 
LE606 LY414 LY602 MA301 MG404 MG493 MJ203 
Pi31S PL601 R0207 SA401 SH206 VC307 VS306 

PL3h VC307 


FR20S J0609 i£606 SH206 VS306 


IS311 J0310 J0607 SA401 VS306 


SA401 VS306 
J0201 J0609 HJ203 


ISA 11 WA4I0 MA412 


C0317 C0403 FR20S FR309 GR202 GR204 J0310 
J0607 J061I SA401 VS306 
GB604 HA301 PLUS 
SA401 

JO607 WA412 


JO609 HJ203 HJ494 HJS05 SH206 
SH206 

J0609 PL31S 
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MJ318 J0609 C0<>03 PLUS 
MJ*9* J0609 MJ403 

MJ909 C0208 F0409 FR*0S 01209 J0310 J0609 

MJJOS VC307 VS308 

OE*92 

PLJ15 00*03 FR309 FR*05 J0310 
PL601 F0603 

R0207 00303 00*02 00317 FR303 LV602 VS308 
SA*01 

SH206 00317 00*03 FR205 FR303 FR309 OE302 

SH206 P 1.315 VC307 VS308 

VC307 00317 

VS308 C0305 FR303 

WA*10 WA*12 

WA*12 F06A3 IS*11 J0611 

IMMSPOSt CITED XITNG - - - ■ 


MULTX CI TNG CITEDF CNG 2 
ROWPRINT CNG 2 
ASVNLOGIN TTCHP 


AOJEC 

BIBU 

CIRCT 

COHPL 

CMPOR 

CMPOV 

CONTX 

CORUN 

eoitg 

DICTE 

FEEOB 

FI LOR 

FREOU 

GRAGR 

GRCOD 

HOMOG 

INPPR 

JAPAN 

MASMD 

MUCOR 

NOUNS 

NUMLS 

POEOT 

PRSAN 

RFVBS 

SCRCH 

SEGHT 

SEMAN 

SEOIA 

SELCif 

S0RT6 

STOAC 

TEXTS 

TIMNG 

TRNSL 

TRLIT 

UPDAT 

VERBS 

WOC L 5 

WOSTM 

R0319 

00407 

B0408 

CH304 

C0208 

CO303 

C031 - 

C0402 

F0306 

F0603 

FR205 

FR303 

FR 304 

FR309 

FR405 

FR40S 

GU91 

61495 

6R20? 

GR204 

GR313 

I 5311 

I S312 

IS411 

J0413 

J0607 

J0609 

J0611 

KU60S 

LES06 

LY4U 

LYS02 

NJ306 

M J316 

MJ494 

MJ505 

0E492 

PL315 

PLS01 

R0207 

WA410 

WA412 








AOJEC 

FR403 

FR612 






BlBLt 

CO402 

R0207 






CIRCT 

CH504 







COHPL 

01493 

J0S09 

MJ494 





CMPOR 

0E492 







CMPOV 

CH504 

0E492 






CONTX 

BO 319 

C0317 

FR309 

FR405 

J0314 

MJ203 


CORUN 

61491 

6R204 

J0201 

J0316 

J04I3 

JOSOT 


CRECT 

C0206 

F0413 

FR205 

FR304 

GR204 

I S3 J 2 


EXTRC 

GE302 

I S311 






LOXUP 

B0407 

J0201 

J0609 

SA401 




01 COR 

61491 

MJ494 

MJS03 





EOITG 

JOS 09 

JOSH 






OICTE 

B0407 







FEEOB 

GB604 

61209 

J0314 

MA30I 




FI LOR 

01491 







FREOU 

C0305 

FR205 

FR303 

VS308 




GRA6R 

C03I7 

FR403 

FRS1? 





GRCOO 

B0407 

C0208 

C0403 

F0409 

F0496 

F030S 


GRCOO 

J0310 

J0413 

LY4U 

LY602 

MG404 

MJ203 


GRCOO 

PL315 

PLS01 

SH20S 

WA4J0 

WA412 



HOMOG 

C0208 

C0317 

GE302 

SM206 

VC *07 



MAO 10 

C0208 

C0403 

F0413 

FR203 

FR104 

01209 


HAD! C 

JOS 09 

LE606 

MG* 04 

MJ494 

MJ509 

PL313 


INFlE 

FO409 

GR313 

(S4U 

WA410 

WA412 



inflr 

F049S 

FO306 

MG493 

MJ503 




INPEO 

GR202 







INPPR 

GR202 

GR204 






l7602 HJ203 MG*0* SH206 


15*11 J0310 J031* JOS 16 


CRECT EXTRC 
HAD 1C INFLE 
PREPO PRODS 
STRAN SUFAN 

00*03 F0A09 
FR61? GH60* 
J0201 JOS 10 
MA301 MG*0* 
SA*0l SH206 


LOKUP 01COR 
INFtR INPEO 
PROER PRONS 
STNAR 5VNAE 

K0*15 F0496 
GC 302 GI209 
J031* J0316 
MC*93 MJ203 
VC307 VS308 


VC307 

VS308 

J031* JO609 LE606 VS30S 


FR303 FR309 GR313 IS*H 
MJ306 MJ31R MJ*9* MJ904 


Gt*91 01*94 1SJ12 ISAM 
VS308 
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JAPAN 

KU605 









MASHO 

SA401 









MUCOR 

J0611 




• 





NOUNS 

C0317 FR*05 

FR612 

VC307 

WA410 






NUMLS 

MG404 









poeot 

GI209 J031* 

HA301 








PRSAN 

B0407 B0409 

GB604 








PREPO 

MJ306 









PRODS 

80319 80*07 

B040B 

C0208 

C0317 

F0*09 

F0413 

F0496 

F0506 

FR205 

PRODS 

FR304 GE302 

GI 491 

G1495 

15311 

IS312 

($411 

J0201 

J0310 

JOS 16 

PRODS 

JO*13 J0607 

J0609 

LE606 

SA401 

SH206 

VS308 

WA*12 



PROER 

J0310 









PRONS 

HJ318 









RFVBS 

LY414 LY602 








.— 

r stnorisTwr raw? 









SEGMT 

KU605 SA401 









SEMAN 

LY*1* 









SEDIA 

PL601 









SElCT 

C0305 









SORTG 

FR205 GI491 

25311 

I S411 

VS308 






STOAC 

0E492 









STRAN 

F0603 PL601 









SUFAN 

CH304 61491 

GR313 

J0316 

KU605 

LY414 

HJ203 

SH206 



SYNAR 

B0319 F0409 

F0603 

FR203 

FR304 

FR406 

FR612 

15411 

J0310 

LV41* 

SYNAR 

PL601 









SYNAE 

80*07 80*06 









TEXTS 

C0*02 R0207 









TIMNG 

GR204 









TRN5L 

F0603 G860* 

GI 209 

J0314 

KU605 

LY602 





TRUT 

CHSO* GR202 

KU60S 








UPDAT 

C0208 F0413 

GI 493 

IS312 

J0609 

LE606 

PL313 




YERBS 

FR*06 FR612 









WOCIS 

F0*09 FR303 

FR309 

FR403 

FR406 

FR612 

GR313 

ICU605 

MG493 

MJ203 

WDCLS 

MJ306 MJJ18 

MJ494 

SH206 

VC307 






WDSTM 

fR309 GE302 

MJ203 








TRANSPOSE TTCMPF TOCMP 









LOOCROSS Cl TNG TOCMP 









LOGCROSS CN6 2 TOCMP 









ENO 
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V. A COMPARISON OF CITATION DATA FOR OPEN AND CLOSED 
DOCUMENT COLLECTIONS 

Michael Lesk 

In a study by Salton of the relation between bibliographic 
citations and document content, the set of citing documents was assumed 
to be identical with the set of cited documents; that is, citations to 
and from documents outside the given document collection were disregarded. 
It was suggested, however, that the same methods might be used with open 
collections in which all citations would be allowed. This section de¬ 
scribes an experiment in which the same computations performed previously 
were repeated with an extended citation network including all citations to 
and from the collection, regardless of source. The code names previously 
used for matrices, index terms, and documents have been retained unchanged 
in the present description. 

Three of the matrices were processed on the IBM 7090 computer in 
both the open and closed forms. These are the matrices CITED, CITNG, and 
CNG 2. In each case the larger number of documents leads to lower-row 
correlations in the open collection, but the relationship to content is 
not seriously impaired. 

The change from CITED closed to CITED open resulted in a relatively 
anall change, because a comparatively small amount of additional data was 
introduced. Although the number of citations rose from 165 to 295, only 
21 new documents were introduced in the open collection. Furthermore, 
most of the new citations were from five works, which had long, exhaustive 
bibliographies. 
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The over-all cross correlation between CITED and TDCMP dropped 
from M0$ in the closed collection to .389 in the open collection. Of the 
five documents with the highest cross correlations, three were common to 
both collections, as shown in the following table: 


CITED Closed 


CITED Open 


Document 


Cross 

Correlation 


Document 


Cross 

Correlation 


MA301 

.753 

MJ505 

.735 

MJ306 

.651 

C0208 

.633 

SH206 

.612 


MJ505 

.689 

SH206 

.682 

MJ203 

.680 


.61*5 

MJ306 

.610 


The row correlation data are similar in both collections, but the 
coefficients are somewhat smaller in the open collection. This is, of 
course, not true for the documents in report NSF-6, the most recent report 
in the collection, where the closed collection row correlations are en¬ 
tirely zero because of the total lack of citation data. The open col¬ 
lection does not remedy this defect entirely, however; newer documents 
still have fewer than the average number of citations, and some articles 
(such as the sole article on analysis of Japanese) have no citations at all. 
Only one document in report NSF-6 has a cross correlation with TDCMP which 
is as large as the over-all cross correlation. The lack of data for the 
last report can be attributed to the limited circulation of the reports, 
preventing most non-Harvard writers from citing them, and to the existence 
of only one additional report in the N5F series, whose authors do have 
access to the documents in the collection. Document collections taken from 



V-3 


more accessible publications and printed earlier than the collection 
studied would probably not show a lack of data for the more recent docu¬ 
ments. 

The extent of the similarity between the open and closed col¬ 
lections in the CITED matrix may be demonstrated by considering the de¬ 
tailed row correlations of a typical document, C0208. The row corre¬ 
lation vectors.for document C0208 are shown in Table 1. The cross 
correlation for this document in the closed collection was .6331 in the 
open collection, *527. Almost all the row correlations are lower in the 
open collection, although the arithmetic difference between the corre¬ 
lations in the open and closed collections varies from document to 
document. Several articles, in fact, do not follow the normal trend, 
having higher row correlations with document C0208 in the open collection 
than in the closed collection. These are documents Cffi>Ol*, GI209, G 11*91, 
J0201, J0310, MJU91*, SH206, and Gl!*95, which are indicated in Table 1 by 
an asterisk. CH5>0l*, whose row correlation rose from .0000 (closed col¬ 
lection) to .3333 (open collection), is unimportant 5 the rise is due to 
a single citation that appeared in the open collection. The content of 
CH^Ol* is completely unlike that of any other article, and presumably a 
sophisticated program evaluating citation data would recognize that the 
high-row correlation was caused by only one citation, and would ignore 
CffiJOl* in studying the content of C0208* 

The other articles that exhibited larger row correlations in the 
open collection were furnished with a substantial number of citations. 

It may therefore be of interest to study the relation of their content 
to that of C0208, C0208 deals with dictionary correction and updating! 
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Document 

Closed 

Open 

B0319 

.3162 

.2981 

ch5o1* 

.0000 

.3333* 

C0305 

.3536 

.3333 

C0l*03 

.2500 

.1667 

F0l*l5 

.1*082 

.381*9 

FOl»96 

.3536 

.2722 

F0506 

.1*082 

.1*082 

FR205 

.2500 

.2222 

FR303 

.1*082 

.3333 

FR301* 

.201*1 

.1667 

FR309 

.1768 

.1667 

GE302 

.1*082 

.381*9 

01209 

.3162 

.3536* 

Gll*91 

.11*1*3 

.181*9* 

0l!*95 

.201*1 

.2357* 

OR313 

.11*1*3 

.1179 

IS311 

.3536 

.3333 

IS312 

.201*1 

.1667 

J0201 

.31*30 

.3727* 

J0310 

.11*1*3 

.11*91* 

LTlill* 

.201*1 

.1925 

U 602 

.2500 

.2357 

MGl*0l* 

.2500 

.1361 

MOl*93 

.2500 

.1721 

MJ203 

.3536 

.2520 

MJ306 

.3162 

.2222 

MJ318 

.201*1 

.1361 

MJl*9l* 

.2500 

.2520* 

mj5o5 

.3062 

.2910 

R0207 

.2887 

.11*91 

SH206 

.291*2 

.323li* 

VC307 

.3536 

.3333 


Row Correlations of C0208 
(All row correlations not listed are .0000) 

TABLE 1 


thus, GI209 (postediting, translation algorithms), J0310 (recognition of 
phrases within sentences), and SH206 (word-by-word syntactic analysis) 
are articles that we should expect to have low-row correlations. J0201 
(continuous dictionary run) should also exhibit a low-row correlation, 
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while QIU91, GlU9fJ> and MJl*9U (dictionary compilation) might be considere 
partial successes for the open collection, since they exhibited larger 
row correlations in that collection• Unfortunately, such directly releva 
articles as F0lil5 had lower row correlations in the open collection than 
in the closed collection. Thus, as far as C0208 is concerned, the closed 
citation data give slightly better results, as indicated by the respectiv 
cross correlations. 

Greater differences between open and closed collection data are 
shown by matrix CITNG, Here the total number of citations more than 
doubled (from 165 to 393) and the number of documents went up corre¬ 
spondingly (from 62 to 175) in going from the closed collection to the 
open collection* The over-all cross correlation fell from *366 (closed 
collection) to .239 (open collection). There is no regular decrease in 
either the row or cross correlations, which increase in the open col¬ 
lection for some documents and decrease for others. The five highest 
cross correlations are as follows: 


CITNG Closed 

CITNG 

Open 

Document 

Cross 

Correlation 

Document 

Cross 
Correia tj 

C0317 

.629 

B0319 

.611* 

J0U13 

.629 

IS1*11 

.601 

VS308 

.623 

J0609 

.526 

J0609 

.620 

J03U* 

.1*11 

ISltll 

.606 

F0603 

.1*08 


Only two documents are common to both lists, and the cross corre¬ 
lations of the open collection are distinctly lower. The row correlation 
also change more than in CITED. B0l*08 has a .080 drop in cross corre¬ 
lations from the closed to open collections (.UUi to .3610* Its row 
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correlation vectors are reproduced in Table 2. No uniform pattern can be 
found. The open collection does not give the Impression, as it did for 
CITED, of being a scaled-down version of the closed collection. In the 
values of its coefficients, the open collection has a definite advantage 
in the low-row correlations of GR202 and GR20ii (.2887 open, 1.0000 closed), 


Document 

Closed 

Open 

B0319 

.5000 

.2041 

B0l*07 

.7071 

.5000 

C0317 

.3015 

.11*91 

CQl*03 

.0000 

.2801 

F0603 

.0000 

.1925 

FR205 

.5771* 

.1925 

FR309 

.5000 

.1925 

QB60l* 

.0000 

.1*361* 

GR202 

1.0000 

.2887 

0R201* 

1.0000 

.2887 

ISlill 

.0000 

.1925 

J0310 

.1*082 

.1925 

J03li* 

.5000 

.2582 

J0316 

.5000 

.2887 

J0U13 

.1*1*72 

.1*361* 

J0607 

.5000 

.1*711* 

J0609 

.0000 

.1179 

J0611 

. 5771 * 

.2582 

KU605 

.0000 

.31*82 

PL601 

.0000 

.201*1 

SAl*01 

.1*1*72 

.1925 

LY602 

.0000 

.1667 

VS308 

.2887 

.1325 

WA.U12 

.0000 

.2357 


Row Correlations for B0ii08 CITNG 
(All correlations not listed are .0000) 

TABLE 2 


which are totally unrelated in content to BGl*08. The higher-row correla¬ 
tion of GBfiOii in the open collection is also desirable, as both GB60li and 
BOI 4 O 8 deal with the same topic (predictive syntactic analysis). A distinct 






disadvantage of the data from the open collection is the introduction of 
many small row correlations with B0l*08 for documents which previously had 
a sero correlation, and are not closely related to B0l*08 (e.g., LY602, 
C0l*03). Such correlations lower the cross correlation because they do not 
correspond to high-row correlations in TDCMP, and also interfere with 
attempts to derive index terms because of their lack of relation to the 
content of the articles* 

The open CITNG matrix was also used for attempts to assign index 
terms to documents from the citation data. The results were not, unfor¬ 
tunately, better than those provided by the closed matrix. The introduc¬ 
tion of a more sophisticated index term assigning procedure also failed 
to raise accuracy above fifty percent. It should be noted, however, that 
the cross correlations are lower in the open collection than in the closed 
collection. Results of the attempt to assign index terms are shown in 
Table 3* The new method for deriving index terms is: 

(1) Select the five highest row correlations for any given 
document. To each of the related documents producing 
these high row correlations, assign a weight equal to 
the row correlation with the given document divided by 
the number of index terms associated with the related 
document. For example, a document with a row correla¬ 
tion of *7071 and with five assigned index terms would 
be weighted *11*11* - *7071/5* 

(2) For each index term, compute the sum of the weights of 
each document of the above five that it is associated 
with in TDCMP, and associate this weight with the term. 
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(1) 

(2) 

(3) 

00 

(S) 


Index Terms 

Number 
of Terms 
in (2) Correct 
(also assigned 
manually) 

Total Number 


Document 

Assigned 
Automatically 
from CITNO 

of Terms 
Assigned 
Manually 

Cross 

Correlation 

FR30U 


1 

1* 

.31*26 

J0607 

3 

1 

2 

.2101 

F0it09 

h 

3 

5 

.3006 

C0208 

6 

3 

6 

.3711 

J0l*13 

3 

3 

3 

.2777 

TOTAL 

21 

11 

20 



Index Terns from CITNO Open 
TABUS 3 


(3) Select those index terms which have a weight greater 
than the weight of any single document. For example, 
suppose documents A, B, C, D, and E had weights 
(respectively) of .1, .1$, .15, .08, and .2. A term 
assigned to documents A and D would not be selected 
(•1 ♦ .08 ■ .18 is less than .2), while a term assigned 
to documents A and C would be selected (.15 ♦ .1 » .25 
is greater than .2). 

The matrix CNG 2 has far more citations and far more documents than 
either CITNO or CITED. Cross correlations are much lower in the open col¬ 
lection than in the closed collection^ the second highest cross correlation 
is only .201 in the open collection as opposed to .71*7 in the closed 
collection. The over-all cross correlation drops from .1*82 to .116. The 
changes in the row correlation vectors are not consistent with respect to 
content, but many new low-row correlations are introduced, making the 
study of the correlations mare difficult and adversely affecting the 




determination of index terms. Irrelevancies may be introduced; the open 
CNG 2 collection contains citations to Mark Twain's ,f The Jumping Frog of 
Calveras County” and Lewis Carroll's essay on assigning prizes in tennis 
tournaments. 

In summary, then, the open collection of documents is inferior to 
the closed collection in terms of content-citation relationships. This 
is probably due to the greater number of extraneous citations which 
obscure the relevant data. This effect ie strongly evident in CUG 2. 

In other matrices, the inferiority of the open collection is not as 
strongly marked, and it is possible that it might be of more use in 
analyzing citation data. As was pointed out under CITNO, there is some 
improvement in the relation of the high row correlations to content, 
while the small row correlations are less indicative of content than they 
were in the closed collection. Thus a scheme of analysis concentrating 
on the large correlations might give better results using the open 
collection, but this seems unlikely. Work done to date confirms the 
impression of the cross correlations, that is, the superiority of the 
closed collection. A conclusion as to the relative merits of CITED 
and CITNO cannot be drawn from these data, because the snail circulation 
of the document collection has prevented adequate citation of its 
members. The lack of data for the more recent documents in the matrix 
CITED prevents general comparisons with matrix CITNO, which has no 
corresponding restriction. 
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VI. ATTEMPTS TO CLUSTER DOCUMENTS WITH CITATION DATA 
Michael Leak 

Since citation data alone are probably not capable of producing 
detailed, accurate, and reliable index terms for documents, an attempt 
was made to determine whether citation data could be used for placing 
documents within general groupings, each group containing documents of 
similar content. The need to place individual documents into subgroups 
of a large collection may arise in two distinct situations: 

(1) An already subdivided collection exists, and it 
is desired to place newly acquired documents into 
their proper groups; 

(2) Groupings are to be produced from a collection 
which has not been previously divided. 

Problem (1) will be considered first. Here we may assume, for 
any given document, that the groupings for all other documents are known. 
The citation data used to study this first problem consisted of the row 

_ X 2 

correlation matrix CITED. 9 A manual division of the collection into 
nine groups, of three to nine documents each, was made. These groupings 
were made solely on the basis of content. Seventeen documents which did 
not fit into groups were omitted. The grouping is given in Appendix A. 

An attempt was then made to plaqe, each document into its correct 
group, using the citation data for that document and the known grouping 
of all other documents. This was done by considering the three highest 
row correlations of the given document in CITED, and the related documents 







VI-2 


associated with these numbers. If two of these related documents belonged 
to the same group, the given document was assigned to that group. If the 
three related documents belonged to three different groups, the group 
containing the document with the highest row correlation was chosen as the 
cluster to which the given document was to be assigned. This procedure 
was applied to the forty-one documents which were previously grouped by 
hand and which had citations in CITED (open). Of these, the correct group 
was selected in all but eight cases. Two of these eight cases involved 
articles in Report NSF-6, for which CITED does not give sufficient data. 

Two more errors involved a single article which was cited more than any 
other article and may have suffered from irrelevant citations. The 
elimination of these three articles with either too few or too many 
citations leaves four errors in thirty-seven documents, or a success rate 
of 69.2$. 

A direct division of the collection into groups was also studied, 
where only citation data were used to derive the clustering. The method 
employed was suggested by Mr. Edward Sussenguth. We may consider any two 
documents as linked by a strength proportional to the size of the row 
correlation between the two documents in a citation matrix. Let us define 
as major links those with row correlations exceeding 0.7. If a document 
has at least one citation link with a coefficient exceeding 0.7, we may 
define a group containing it as the set of all documents joined to the given 
document through a chain of links all greater than 0.7 in strength. This 
produces several clusters in a typical citation matrix, but leaves documente 
without links of 0.7 or greater ungrouped. These documents may be grouped 
by assigning them to that cluster which contains a document that is linked 
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to the given document by a strength of O.ft 6 or greater, if any such 
cluster exists. If the given document to* a no links of that strength 
(including all row correlations more tian . 0.6), it may be viewed as an 
extraneous document, or links of strength i 0.5 can be studied. Clearly, 
the constants 0.6 and 0.7 in this method a are arbitrary and can be 
raised or lowered as demanded by the individual data. 

The citation matrices CITED (openl-)and CNO 2 (closed) were 
clustered by this procedure. TDCMP wasaOso clustered in the same way 
to compare it with the hand division made earlier. The rough results 
are shown in Appendices B, C, and D. Generally, the groupings do not 
accurately reproduce the hand division. OTDCMP comes closest, as would 
be exptected. CNO 2 is the worst, d e spit©e the fact that the cross 
correlation of CNO 2 with ^DCMP is higher than that of CITED with TDCMP. 
Roughly, of the nine groups in the hand dSl vis ion, six are retained more 
or less intact in the TDCMP clustering, abdbout three in CITED, and one 
in CNG 2. For example, the group of docurrments on postediting and the 
'•trial translator,•* which consists of 01209, MA301 f and J 031 h, is 
retained in the TDCMP clustering (with thes addition of 0B60li, which does 
not belong in this group), is still retainmed in the CITED clustering, 
but is split over three groups in the GUO 2 clustering. The continuous 
dictionary run articles, J0201, J0316, JQluaU* and J0607, are still 
clustered in the TDCMP grouping, but ara saplit up in the CITED and 
CNO 2 groupings. Other rearrangements are« also apparent from the 
clusterings given in the Appendices. 




More eophisticated methods of clustering the collection from the 
citation data have been proposed, but it is doubtful that really substantial 
improvement is to be expected. For example, the links between BQl*08, GR202, 
and GR201* in CNG 2 are all of maximum strength (all row correlations 1.0000), 
and yet the documents should not be in the same group. It would seem that 
some form of additional information will be needed to succeed. 

APPENDIX A 
MANUAL CLUSTERING 

(1) GI209 - MA301 - J03lii (postediting) 

(2) J0201 - J0316 - J0l*13 - J0607 (continuous dictionary run) 

(3) BOI 4 O 7 - BQl*08 - GB 60 I* (predictive syntactic analysis) 

(1*) F0603 - WAl*10 - WAl*12 - F0li09 - GR313 - ISf*ll (English inflection) 

(5) FR301* - FR20J> - 73308 (editing programs for syntactic study) 

(6) 011*91 - 011*95 - MJ505 - C01*03 - MJ306 - MJ318 - PL315 - MGl*0l* - MJ203 

(dictionary compilation) 

(7) C0208 - FQ!*l5 - IS312 - J0609 - LE606 (dictionary correction) 

(8) 3H206 - FR309 - GE302 - FRl*06 - FR612 - FRl*05 - C0317 - FR303 

(Russian syntactic analysis) 

(9) MGl*93 - F0l*96 - F0506 - MJl*9l* (Russian inflection) 






APPENDIX B 


TDCMP CLUSTERING 

(1) 01209 - MA301 - J0314 - OB604 

(2) J0201 - J0316 - J0413 - J0607 

(3) B0lj07 - B0408 

(h) WA102 - VAklO - P0l*09 - ISl»ll - GR313 - F04?6 - Fosse - SH206 - J0310 

(9) C0208 - F04l9 - IS312 - FR304 - FR209 - VS308 - LE 606 - J06O9 - 031*9$ 

(6) M(3l*0l* - C0l*03 - PL319 - MJ$09 - MJ494 

(7) MJ203 - MJ318 - MJ306 - FR303 

(8) C0317 - FR409 - FR612 - FR406 - VC307 

(9) R0207 - C0l*02 

APPENDIX C 
CITED CLUSTERING 

(1) 01209 - MA301 - J0311* 

(2) B0l*07 - B0l*08 - J0607 

(3) WAl*10 - ISl*ll - F0l*09 - J0316 - J0U13 - FR612 - J0611 - FR409 - FRj*06 

(1*) C0l*03 - MJ203 - MJ318 - MQl*Ol* - MJ306 - VS308 - LS312 - FR301* - PR20$ 

(9) F0l*96 - F0906 - MJ494 - MOl*93 - 01491 - 01499 

(6) LY602 - LY414 - IS311 
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APPENDIX D 
CNG 2 CLUSTERING 


(1) MA301 - (ffi60U 

(2) J03U* - J0316 - B0319 - 0E302 - FR205 - SAljOl 

(3) 01209 - C0208 - SH206 - J0201 - P0l*96 - MJ$0$ - MJ203 

(!»} B0&O7 - BGt*08 - GR202 - CR20fc 

(5) FR30U - 13311 - F0l*l5 

(6) FRliOS - FR309 - FR612 - C0317 - FR303 - PL315 - C0lt03 - VC307 - 

VS308 - J0U13 - C0305 - J0310 

(7) U602 - J0609 
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